Skip to primary content
Skip to secondary content

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

The Journeyler

Main menu

  • Home
  • CV/Resume
  • Photography
  • Location
    • Cartography
    • Geo-Tagging
    • GPS
  • Language Documentation
    • Linguistics
    • Digital Archival
  • Open Drafts
  • Archives

Tag Archives: OCR

TIFFs, PDFs and OCR

Posted on August 16, 2011 by Hugh Paterson III
Reply

In the course of my experience I have been asked about PDFs and OCR several times. The questions usually follow the main two questions of this post.

So is OCR built into PDFs? or is there a need for independent OCR?

In particular an image based PDF, is it searchable?

The Short answer is Yes. Adobe Acrobat Pro has an OCR function built in. And to the second part: No, an image is not searchable. But what can happen is that Adobe Acrobat Pro can perform an OCR function to an image such as a .tiff file and then add a layer of text, (the out put of the OCR process) behind the image. Then when the PDF is searched it actually searches the text layer which is behind the image and tries to find the match. The OCR process is usually between 80-90% accurate on texts in english. This is usually good enough for finding words or partial words.

The Data Conversion Laboratory has a really nice and detailed write up on the process of converting from images to text with Adobe Acrobat Pro.

Daily Designer has a tutorial on how to do it on OS X.
David R. Mankin explains on his blog what the process looks like using Windows.

One of the beauties of Adobe Acrobat Pro is that this process can be scripted and the TIFFs processed in batches.
[On Windows] :: [On OS X using AppleScript] :: [Cross platform help from Adobe]

University Illinois Chicago explains how to do use Adobe Acrobat Pro and OCR with a scanner using a TWAIN driver.

The better OCR option

Since I work in an industry where we are dealing with multiple languages and the need to professionally OCR thousands of documents I thought I would provide a few links on the comparison of OCR software on the market.
Lifehacker has short write up of the top five OCR tools.

Of those top 5, in this article, two, ABBYY Fine Reader and Adobe Acrobat are compared side by side on both OS X and Windows.

Are all files used to create an orignal PDF included in the PDF?

One thing to remember, Which I have said before, is that not all PDFs are created equal. This Manual talks a bit about different settings inside of PDFs when using Adobe’s PDF printer.

The Short answer is No. But the long answer is Yes. Depending on the settings of the PDF creator the original files might be altered before they are wrapped in a PDF wrapper.

So the objection, usually in the form of a question sometimes comes up:

Is the PDF file just using the PDF framework as a wrapper around the original content? Therefore, to archive things “properly” do I still need to keep the .tiff images if they are included in the PDF document?

The answer is: “it depends”. It depends on several things, one of which is, what program created the PDF and how it created the PDF. – Did it send the document through PostScript first? Another thing that it depends on is what else might one want to do with the .tiff files?

In an archiving mentality, the real question is: “Should the .tiff files also be saved?” The best practice answer is Yes. The reason is that the PDF is viewed as a presentation version and the .tiff files are views as the digital “originals”.

Posted in Access, Digital Archival, Images, Meta-data, Running On OS X | Tagged Adobe Acrobat Pro, archival, OCR, PDF | Leave a reply

Activity

February 2023
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728  
« Jan    

I’ve been saying

  • OLAC Usage of DCMIType Collection
  • OALC should implement oEmbed
  • OLAC data quality investigator
  • Resetting OLAC Documentation
  • OLAC Validator Custom Messages
  • LCSH without Identifiers….
  • Finally glued
  • Fixing Brio
  • Filter order…
  • PPPM nonprofit management PhD
  • Annotated Bibliography Options
  • Example Application Profiles with Dublin Core

Say What?

  • mom on Identifying as female
  • kristina Oma Cartwright on Job possibility
  • kristina Oma Cartwright on Job possibility
  • kristina Oma Cartwright on Afternoon snack
  • Oma on South Sister Climb

One should not consider the content on this website to be an official opinion of any company associated with me. These posts are solely my opinion.

Proudly powered by WordPress