10

Open source preferred, but not necessary.

I've got Adobe Acrobat 8, and really like the OCR feature which can essentially put an invisible layer of OCR'd text on top of a scanned document. Thus what you see on screen is the original scanned document, but the result is searchable.

What I'm looking for is a way to automate this process. I've currently got a few scripts that we use for processing and archiving scanned files, and am looking for something that I can plug right in to this batch process to do OCR in a manner similar to what I can do with Acrobat.

All suggestions welcome, thanks!

HopelessN00b
  • 54,273
Boden
  • 5,028

3 Answers3

8

I have this implemented in a company document archveiving project. Scanned file is a tif file(single page). Then using Cuneiform to create a hocr file of the single tif. Then using hocr2pdf to output the PDF file. If multiple scan pages, I use gs to combine the PDFs into a single PDF document. Works really well, OCR is good enough for our needs and is searchable in any PDF viewer.

xeon
  • 3,816
1

Have you looked at WatchOCR? You can download it from http://www.watchocr.com It is a free and open source OCR server that transforms image only pdfs into text searchable pdfs from a watched folder or network share.

0

I like the sounds of xeon's answer, though OCRopus sounds like a lot of fun.

Kara Marfia
  • 7,882