In the course of my experience I have been asked about PDFs and OCR several times. The questions usually follow the main two questions of this post.
So is OCR built into PDFs? or is there a need for independent OCR?
In particular an image based PDF, is it searchable?
The Short answer is Yes. Adobe Acrobat Pro has an OCR function built in. And to the second part: No, an image is not searchable. But what can happen is that Adobe Acrobat Pro can perform an OCR function to an image such as a .tiff file and then add a layer of text, (the out put of the OCR process) behind the image. Then when the PDF is searched it actually searches the text layer which is behind the image and tries to find the match. The OCR process is usually between 80-90% accurate on texts in english. This is usually good enough for finding words or partial words.
The Data Conversion Laboratory has a really nice and detailed write up on the process of converting from images to text with Adobe Acrobat Pro.
Daily Designer has a tutorial on how to do it on OS X.
David R. Mankin explains on his blog what the process looks like using Windows.
One of the beauties of Adobe Acrobat Pro is that this process can be scripted and the TIFFs processed in batches.
[On Windows] :: [On OS X using AppleScript] :: [Cross platform help from Adobe]
University Illinois Chicago explains how to do use Adobe Acrobat Pro and OCR with a scanner using a TWAIN driver.
The better OCR option
Since I work in an industry where we are dealing with multiple languages and the need to professionally OCR thousands of documents I thought I would provide a few links on the comparison of OCR software on the market.
Lifehacker has short write up of the top five OCR tools.
Of those top 5, in this article, two, ABBYY Fine Reader and Adobe Acrobat are compared side by side on both OS X and Windows.
Are all files used to create an orignal PDF included in the PDF?
One thing to remember, Which I have said before, is that not all PDFs are created equal. This Manual talks a bit about different settings inside of PDFs when using Adobe’s PDF printer.
The Short answer is No. But the long answer is Yes. Depending on the settings of the PDF creator the original files might be altered before they are wrapped in a PDF wrapper.
So the objection, usually in the form of a question sometimes comes up:
Is the PDF file just using the PDF framework as a wrapper around the original content? Therefore, to archive things “properly” do I still need to keep the .tiff images if they are included in the PDF document?
The answer is: “it depends”. It depends on several things, one of which is, what program created the PDF and how it created the PDF. – Did it send the document through PostScript first? Another thing that it depends on is what else might one want to do with the .tiff files?
In an archiving mentality, the real question is: “Should the .tiff files also be saved?” The best practice answer is Yes. The reason is that the PDF is viewed as a presentation version and the .tiff files are views as the digital “originals”.