Use Tesseract to OCR a multiple page PDF ---------------------------------------- Last edited: $Date: 2018/05/30 18:39:38 $ ## Tesseract to OCR Tesseract is a well known open source OCR engine. It is famous for its quality. ## ghostscript to convert PDF to tiff First we must convert our PDF to a tiff file. Tesseract requires image files, so first we have to convert the PDF to images. When we use ghostscript for this, we will get high quality images. ## Converting the PDF to tiff: gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 \ -sOutputFile=output_filename.tiff input_filename.pdf ## Convert tiff to text with Tesseract tesseract output_filename.tiff text_file -l eng The file text_file will become the ouput file. Tesseract will put an ".txt" extension to the filename. $Id: ocrwithtesseract.txt,v 1.3 2018/05/30 18:39:38 matto Exp $