[HN Gopher] Pdfgrep - a commandline utility to search text in PD...
       ___________________________________________________________________
        
       Pdfgrep - a commandline utility to search text in PDF files
        
       Author : kretaceous
       Score  : 187 points
       Date   : 2022-09-25 14:38 UTC (8 hours ago)
        
 (HTM) web link (pdfgrep.org)
 (TXT) w3m dump (pdfgrep.org)
        
       | neilv wrote:
       | I've long used `pdfgrep` in a very kludgey way, when stockpiling
       | rare variants of particular used ThinkPad models (for looking up
       | _actual_ specs based on an IBM  "type" number shown in a photo on
       | an eBay listing, since the seller's listing of specs is often
       | incorrect).
       | 
       | Example shell function:                   t500grep() {
       | pdfgrep "$1" /home/user/doc/lenovo-psref-withdrawn-
       | thinkpad-2005-to-2013-2013-12-447.pdf         }
       | 
       | Example run:                   $ t500grep 2082-3GU
       | 2082-3GU      T9400   2.53   2GB   15.4" WSXGA+    Cam GMA, HD
       | 3650     160G   7200   DVD+-RW       Intel 5100            2G
       | Turbo   9   Bus 32   Aug 08
       | 
       | The Lenovo services to look up this info come and go, and are
       | also slow, but a saved copy of the data lives forever.
       | 
       | (A non/less-kludge way would be to get the information from all
       | my IBM/Lenovo PSREFs into a lovingly-engineering
       | database/knowledge schema, simple CSV file, or `grep`-able ad hoc
       | text file.)
        
       | mistrial9 wrote:
       | +1 -- after trying several tool sets over years, _pdfgrep_ is
       | currently used daily around here
        
       | donio wrote:
       | For Emacs users there is also https://github.com/jeremy-
       | compostella/pdfgrep which lets you browse the results and open
       | the original docs highlighting the selected match.
       | 
       | It's on MELPA.
        
         | nanna wrote:
         | Tried this the other day but couldn't figure out how to use it.
         | Is it invoked from eshell, dired, a pdftool buffer, or what?
        
       | radicalbyte wrote:
       | Cool, about 15 years ago I built something similar for PDF,
       | Office (OpenXML) as well as plain text as part of a search
       | engine. Commercial/closed source of course but it was super
       | handy.
        
       | majkinetor wrote:
       | For Windows, there is dngrep: http://dngrep.github.io
        
         | gabythenerd wrote:
         | Love dngrep, one of my friends used to combine all his pdfs
         | into one and then use adobe reader to search before I showed
         | this to him. It's very powerful and also simple to use, even
         | for non-technical users.
        
       | moonshotideas wrote:
       | Out of curiosity, how did you solve the issue of extracting text
       | from the pdf, error free? Or did you use another package?
        
         | kfarnung wrote:
         | Looking at the list of dependencies, it seems like they use
         | poppler-cpp to render the PDFs.
         | 
         | https://gitlab.com/pdfgrep/pdfgrep#dependencies
        
           | dmoo wrote:
           | Popper tools pdftotext -layout is great
        
         | Frost1x wrote:
         | Curious as well. About a year ago I was implementing what I
         | thought naively might not be a very difficult verification that
         | a specific string existed (case sensitive or insensitive)
         | within a PDF's text and had many cases where text viewed was
         | clearly rendered in the document but many libraries couldn't
         | identify the text. It's my understanding there's a lot of
         | variance in how a rendered PDF may be presenting something one
         | may assume is a simple string that really isn't after going
         | down the rabbit hole (wasn't too surprising because I dont like
         | to make simplicity assumptions). I couldn't find anything at
         | the time that seemed to be error free.
         | 
         | Aside from applying document rendering with OCR and text
         | recognition approaches, I ended up living with some error rate
         | there. I think PDFgrep was one of the libraries I tested. Some
         | other people just used libraries/tools as is with no sort of
         | QAing but from my sample applying to several hundred verified
         | documents, pdfgrep (and others) missed some.
        
       | nip wrote:
       | Tangential:
       | 
       | Some time ago I built an automation [1] that identifies whether
       | the given PDFs contain the specified keywords, outputting the
       | result as a CSV file.
       | 
       | Similar to PDFGrep, probably much slower, but potentially more
       | convenient for people preferring GUIs
       | 
       | [1] https://github.com/bendersej/pdf-keywords-extractor
        
       | mistermann wrote:
       | A bit of a tangent, but does anyone know of a good utility that
       | can index a large number of PDF files so one can do fast keyword
       | searches across all of them simultaneously (free or paid)? It
       | seems like this sort of utility used to be very common 15 years
       | ago, but local search has kind of died on the vine.
        
         | [deleted]
        
         | kranner wrote:
         | DEVONsphere Express, Recoll, also the latest major version of
         | Calibre.
        
         | crtxcr wrote:
         | I am working on looqs, it can do that (and also will render the
         | page immediatly): https://github.com/quitesimpleorg/looqs
        
         | donio wrote:
         | Recoll is a nice one, uses Xapian for the index.
         | 
         | https://www.lesbonscomptes.com/recoll/
        
         | summm wrote:
         | Recoll?
        
           | tombrossman wrote:
           | Yes, +1 for Recoll. It can also OCR those PDFs that are just
           | an image of a page of text, and not 'live' text. Read the
           | install notes and install the helper applications.
           | 
           | When searching I'll first try the application or system's
           | native search utility, but most of the time I end up opening
           | Recoll to actually find the thing or snippet of text I want,
           | and it has never failed me.
           | 
           | https://www.lesbonscomptes.com/recoll/pages/features.html#do.
           | ..
        
         | llanowarelves wrote:
         | dtSearch
        
         | sumnole wrote:
         | While we're asking for tool tips: does anyone know of a tool
         | that will cache/index web pages as the user browses, so that it
         | can be searched/viewed offline later?
        
         | pletnes wrote:
         | Macos' spotlight can do this AFAIK.
        
           | pulvinar wrote:
           | Yes, and Spotlight's also useable from the command line as
           | mdfind, which has an -onlyin switch to restrict the search to
           | a directory.
        
       | marttt wrote:
       | I've been using Ali G. Rudi's pdftxt with my own shell wrappers.
       | From the homepage: "uses mupdf to extract text from pdf files;
       | prefixes each line with its page number for searching."
       | 
       | Usually I 1) pdftxt a file and 2) based on the results, jump to a
       | desired page in Rudi's framebuffer PDF reader, fbpdf. For this,
       | the page number prefix in pdftxt is a particularly nice default.
       | No temptations with too many command line options either.
       | 
       | https://litcave.rudi.ir/pdftxt-0.7.tar.gz
        
       | thriftwy wrote:
       | Catdoc utility does the same for .doc MS word files. Maybe for
       | PDFs also.
        
       | sauercrowd wrote:
       | pdfgrep is great. Worked like a charme to diff updates to a
       | contract
        
       | Findecanor wrote:
       | Just what I needed to search my collection of comp-sci articles.
       | Regular grep fails on most PDFs.
       | 
       | I installed the Ubuntu package. Thanks!
        
       | ankrgyl wrote:
       | DocQuery (https://github.com/impira/docquery), a project I work
       | on, allows you to do something similar, but search over semantic
       | information in the PDF files (using a large language model that
       | is pre-trained to query business documents).
       | 
       | For example:                 $ docquery scan "What is the due
       | date?" /my/invoices/       /my/invoices/Order1.pdf       What is
       | the due date?: 4/27/2022       /my/invoices/Order2.pdf       What
       | is the due date?: 9/26/2022       ...
       | 
       | It's obviously a lot slower than "grepping", but very powerful.
        
         | pugio wrote:
         | Wow this is exactly what I've been looking for, thank you! I
         | just wish with these transformer models it was possible to
         | extract a structured set of what the model "knows" (for e.g.
         | easy search indexing ). These natural language question systems
         | are a little too fuzzy sometimes.
        
           | a1369209993 wrote:
           | > to extract a structured set of what the model "knows"
           | 
           | To be fair, that's impossible in the general case, since the
           | model can know things (ie be able to answer queries) without
           | knowing that it knows them (ie being able to produce a list
           | of anserable queries by any means significantly more
           | efficient than trying every query and seeing which ones
           | work).
           | 
           | As a reducto ad absurdum example, consider a 'model'
           | consisting of a deniably encrypted key-value store, where
           | it's outright cryptographically guaranteed that you can't
           | effiently enumerate the queries. Neural networks aren't
           | _quite_ that bad, but (in the general-over-NNs case) they at
           | least superficially _appear_ to be pretty close. (They 're
           | definitely not _reliably_ secure though; don 't depend on
           | that.)
        
           | ankrgyl wrote:
           | Can you tell me a bit more about your use case? A few things
           | that come to mind:
           | 
           | - There are some ML/transformer-based methods for extracting
           | a known schema (e.g. NER) or an unknown schema (e.g. relation
           | extraction). - We're going to add a feature to DocQuery
           | called "templates" soon for some popular document types (e.g.
           | invoices) + a document classifier which will automatically
           | apply the template based on the doc type. - Our commercial
           | product (http://impira.com/) supports all of this + is a
           | hosted solution (many of our customers use us to automate
           | accounts payable, process insurance documents, etc.)
        
             | mbb70 wrote:
             | Since you mention insurance documents, could you speak to
             | how well this would extract data from a policy document
             | like https://ahca.myflorida.com/medicaid/Prescribed_Drug/dr
             | ug_cri... ?
             | 
             | The unstoppable administrative engine that is the American
             | Healthcare system produces hundreds of thousands of
             | continuously updated documents like this with no
             | standardized format/structure.
             | 
             | Manually extracting/normalizing this data into a querable
             | format is an industry all its own.
        
               | ankrgyl wrote:
               | It's very easy to try! Just plug that URL here:
               | https://huggingface.co/spaces/impira/docquery.
               | 
               | I tried a few questions:                 What is the
               | development date? -> June 20, 2017       What is the
               | medicine? -> SPINRAZA(r) (nusinersen)       How many
               | doses -> 5 doses       Did the patient meet the review
               | criteria? -> Patient met initial review criteria.
               | Is the patient treated with Evrysdi? -> not
        
             | pugio wrote:
             | Your commercial product looks very cool, but my use case is
             | in creating an offline-first local document storage system
             | (data never reaches a cloud). I'd like to be enable users
             | to search through all documents for relevant pieces of
             | information.
             | 
             | The templates sound very cool - are they essentially just
             | using a preset list of (natural language) queries tied to a
             | particular document class? It seems like you're using a
             | version of donut for your document classification?
        
               | ankrgyl wrote:
               | > but my use case is in creating an offline-first local
               | document storage system (data never reaches a cloud).
               | 
               | Makes sense -- this is why we OSS'd DocQuery :)
               | 
               | > The templates sound very cool - are they essentially
               | just using a preset list of (natural language) queries
               | tied to a particular document class? It seems like you're
               | using a version of donut for your document
               | classification?
               | 
               | Yes that's the plan. We've done extensive testing with
               | other approaches (e.g. NER) and realized that the
               | benefits of using use-case specific queries
               | (customizability, accuracy, flexibility for many use
               | cases) outweigh the tradeoffs (NER only needs one
               | execution for all fields).
               | 
               | Currently, we support pre-trained Donut models for both
               | querying and classification. You can play with it by
               | adding the --classify flag to `docquery scan`. We're
               | releasing some new stuff soon that should be faster and
               | more accurate.
        
               | pugio wrote:
               | Sweet! I'll keep an eye on the repo. Thank you for open
               | sourcing DocQuery. I agree with your reasoning: my
               | current attempts to find an NER model that covers all my
               | use cases have come up short
        
         | ultrasounder wrote:
         | This is so epic. I was just ruminating about this particular
         | use-case. who are your typical customers. Supply chain or
         | purchasing? Also I notice that you do text extraction from
         | Invoices? Are you using something similar to CharGRID or its
         | derivate BERTGRID? Wish you and your team more success!
        
           | ankrgyl wrote:
           | Thank you ultrasounder! Supply chain, construction,
           | purchasing, insurance, financial services, and healthcare are
           | our biggest verticals. Although we have customers doing just
           | about anything you can imagine with documents!
           | 
           | For invoices, we have a pre-trained model (demo here:
           | https://huggingface.co/spaces/impira/invoices) that is pretty
           | good at most fields, but within our product, it will
           | automatically learn about your formats as you upload
           | documents and confirm/correct predictions. The pre-trained
           | model is based on LayoutLM and the additional learning we do
           | uses a generative model (GMMs) that can learn from as little
           | as one example.
           | 
           | LMK if you have any other questions.
        
       | greggsy wrote:
       | I've been looking for an alternative to Acrobat's 'Advanced
       | Search' capability, which allows you to define a search directory
       | (like NP++). This feature is possible with pdfgrep and other
       | tools, but the killer for me is that it displays results in
       | context, and allows you to quickly sift through dozens or
       | hundreds of results with ease.
       | 
       | It's literally the only reason why I have acrobat installed.
        
       | asicsp wrote:
       | See also https://github.com/phiresky/ripgrep-all (`ripgrep`, but
       | also search in PDFs, E-Books, Office documents, zip, tar.gz,
       | etc.)
        
         | xenodium wrote:
         | Was already a big fan of ripgrep and pdfgrep. rga (ripgrep-all)
         | was such a hidden gem for me.
        
       | MisterSandman wrote:
       | Used this for online quizzes for one of my courses. It was pretty
       | good, but using the terminal for stuff like this still sucks
       | since you can't just click into the PDF.
       | 
       | I wish PDFs had a Ctrl-Shift-F search like VS Code that could
       | search for text in multiple pdfs in a directory.
        
         | darkteflon wrote:
         | I use Houdahspot on MacOS for this. You can either invoke it
         | globally with a custom hotkey (I use hyper 's', for "search"),
         | or from within a specific directory via an icon it adds to the
         | top right corner of Finder.
        
         | greggsy wrote:
         | Advanced Search in Acrobat allows you to define a directory.
         | It's the only reason I ever have it installed.
        
       ___________________________________________________________________
       (page generated 2022-09-25 23:00 UTC)