SWMBO has a pile of PDF documents to process and extract information from, and over 50 of them are scanned which means — NO COPY/PASTE! Unless we rescan with OCR of course. On Windows, she’d probably just use Acrobat, but on Linux…
SWMBO and I both use Fedora Linux, because we mostly work building WordPress and other PHP/MySQL website solutions, and Linux seems a natural fit. So I set out to find the best and easiest approach to running OCR on PDFs. I found a rather good article on the Ubuntu Community Help Wiki — OCR – Optical Character Recognition — which provides a few good options.
I took a quick look at gscan2pdf since it sounded promising: A simple GUI tool that SWMBO could use to run OCR on a PDF, just the ticket. Except that the results are pretty awful and disjoint. What it gives you is a bunch of disparate images each with a spotty OCR output in text.
Tesseract gets the best wrap as a command line tool, but it spits out plain text files. That’s workable, but it means switching between the PDF and the text file to find the OCR’d text associated with a page, which can be confusing and tedious
So I turned to pdfocr, a nice little Ruby script that automates the conversion using tesseract (amongst other options). It takes the PDF document, extracts the scanned images, processes each with tesseract, and pieces it all back together again as a PDF. You get to look at the original scanned document and select the OCR’d text from it, just as you would in Acrobat.
The trick to getting pdfocr to work on Fedora is to obtain a copy of exact-image, a package that contains a bunch of handy image and file processing utilities including hocr2pdf. The easiest way to get this on Fedora is to grab a copy off the RPM Search at rpm.pbone.net.
With exact-image installed, I downloaded pdfocr and copied the .rb file into /usr/local/bin, the .1 (manual page) file into /usr/local/share/man/man1, and was all set to go. To automate OCR scanning of those 50+ PDF files I just needed a quick command line driver. I can never remember how to use xargs properly, so I wrote this simple sed script.
mkdir processed; ls *.pdf | sed -rn 's|(.*)|pdfocr.rb -t -i "\1" -o "processed/\1"|p' | sh;
Job is automatically done, creating a folder full of PDFs for SWMBO to work on. I love open source :)
Edit: I just had to set this up again, on Ubuntu 14; here’s what I needed to install with apt-get, in addition to pdfocr:
- exactimage
- pdftk
- poppler-utils
- tesseract-ocr
- tesseract-ocr-eng (or whichever languages you need)