Tutorials for ExifTool

Link

This looks awesome. I'll have to remember this for those situations where I am looking to embed metadata.

http://www.avpreserve.com/exiftool-tutorial-series/

Adding white space to PDF after cropping

Posted on November 1, 2013 by Hugh Paterson III

I have a PDF that I would like to crop to text and then add consistent white space (margin). The PDF was generated by a Bookeye 4 scanner. Which exported the content straight to PDF. So, I am trying to do this with Adobe Acrobat 9.2. SIL Americas Area Publishing suggested that I use ScanTailor - An excellent program, but one which I find crashes on OS X.

Continue reading →

Teaching FLEx in Malaysia

Posted on October 31, 2011 by Hugh Paterson III

In October, Becky and I were invited to present FLEx at the Universiti of Malaysia, Sabah as part of a workshop for compiling native dictionaries and managing cultural data. I learned a lot about dictionaries, about using FLEx to organize dictionary data, about Webonary and about Malaysia.

One of the things this workshop helped me to clearly articulate was that there are four knowledge content areas which dictionary creators need:

Knowledge about Theoretical Linguistics to understand the language being described and the categories possible in the dictionary.
Knowledge about the language being analyzed and described so that they can apply the appropriate options available to this situation.
Knowledge about how to manage the editorial process for the dictionary (including entry submission).
Knowledge about how to use the software to implement the editorial process.

This workshop’s focus was only on the software used to implement the editorial process (mostly the data collection part of the editorial process). So in some ways it felt like we weren’t giving the participants all the tools they will need (or even showing them all the tools they will need). But we had to realize that it is not our responsibility to give them all the tools they need or to expose them to these issues. They need local contacts for that. Regardless of these issue we were still ecstatic that there were about 80 people in attendance.

About 80 people

Opening Cerimonies at UMS

Becky took most of the sessions on FLEx. She presented on using FLEx as a tool for collecting words and various things about words. We covered several input methods and features in the application.

Becky talking about FLEx as a tool

Becky helping people doing exercises

I presented a session on explaining how to get data out of FLEx. We talked about putting dictionary data on the web and turning it into .epub files.

Hugh presenting on getting things out of FLEx

I think one of the more interesting things that I learned was about expectations, culture and photographs.
Many people wanted photographs with us (or of us). This is not totally unexpected. What was unexpected was that rather than taking one photo and sharing it (passing it around), everyone wanted their own picture. Not their own picture with us but a picture with us made with their own camera! It was in that moment that I had an epiphany. Having training in Language Documentation I am aware and concerned with rules and laws concerning privacy. In the U.S. when dealing with issues of informed consent and intellectual property, it can not be assumed that if I want to take a picture of you that I, the owner of the camera, own the picture. Furthermore it can not be assumed that I have the right to do with that picture as I please. i.e. Post it to the internet. This may be in part that our laws are based on our semantics. It may be in part our culture. But there I realized that if the photo is taken with your camera you own the photo. You can do with it as you please. The asking for permission is that you have asked for permission to take the photo.

Taking our picture

Taking their picture, while they were taking a picture of us. Since he who owns the camera, owns the picture...

I took this last picture at about the same time I had the epiphany.

The iPad Team

Posted on September 16, 2011 by Hugh Paterson III

The Concept

I have had some ideas I wanted to try out for using the iPad as a tool for collecting photo metadata. Working in a corporate archive, I have become aware of several collections of photos without much metadata associated with them.

The photos are the property of (or are in the custodial care of) the company I work at (in their corporate archive).

The subject of the photos are either of two general areas:

The minority language speaking people groups that the employees of a particular company worked with, including anthropological topics like ways of life, etc.
Photos of operational events significant to telling the story of the company holding the photos.

Archives in more modern contexts are trying to show their relevance to not only academics, but also to general members of communities. In the United States there is a whole movement of social history. There are community preservation societies which take on the task of collecting old photographs and their stories and preserving, and presenting them for future generations.

The challenge at hand is: "How do we enrich photos by adding metadata to photos in the collections of archives?" There are many solutions to this kind of task. The refining, distilling, and reduction of stories and memories to writing and even to metadata fields is no easy task, nor is it a task that one person can do on their own. One solution, which is often employed by community historians is the personal interview. By interviewing the photographers or people who were at an event and asking them questions about a series of photos it presents an atmosphere of inquisitiveness and one where the story-teller is valued because they have a story-listener. This basic personal connection allows for interactions to occur across generational and technological barriers.

The crucial question is: "How do we facilitate an interaction which is positive for all the parties involved?" The effort and thinking behind answering this question has more to do with shaping human interactions than with anything else. We are also talking about using technology in this interaction. This is true UX or (User Experience).

Past Experience

This past summer I have had several experiences with facilitating one-on-one interactions between knowledgeable parties working with photographs and with someone acting on behalf of the corporate archive. To facilitate this interaction a GoogleDoc Spreadsheet was set up and the person acting on the behalf of the archive was granted access to the spreadsheet. The individual conducting the interview and listening to the stories brought their own netbook (small laptop) from which to enter any collected data. They were also given a photo album full of photos, which the interviewee would look through. This set-up required overcoming several local environmental challenges. As discussed below, some of these challenges were better addressed than others.

Association of Data to a Given Photo

The challenge of keeping up to 150 photos organized durring an interview so that metadata about any given photo could be collected and associated with only that photo. This was addressed by adhering an inventory sticker to the back of each photo and assigning each photo a single row in the GoogleDoc Spreadsheet. Using GoogleDocs was not the ideal solution, but rather than a solution of some compromises:

Strengths of GoogleDocs

One of the great things about GoogleDocs is that the capability exists for multiple people to edit the spreadsheet simultaneously.
Another strength of GoogleDocs is that there is a side bar chat feature so that if there is a question durring the interview that help could be had very quickly from management (me, who was offsite).
The Data can be exported in the following formats: .xlsx , .xls , .csv , .pdf.
There was no cost to deploy the technology.
It is accessible through a web-browser in an OS neutral manner.
The document is available wherever the internet is available.
A single solution could be deployed and used by people digitizing photos, recording written metadata on the photos, and gathering metadata during an interview.
Most people acting on behalf of the archive were familiar with the technology.

Pitfalls of GoogleDocs

More columns exist in the spread sheet than can be practically managed (The columns are presented below in a table). There are about 48 values in a record and there are about 40,000 records.

More columns than can be practically managed

Does not display the various levels of data as levels of data as levels in the user interface.
Cannot remove unnecessary fields from the UI of various people. (No role-based support.)
Only available when there is internet.

Maximizing of Interview Time

To maximize time spent with the interviewee the photos and any metadata written or known about a photo was put into the GoogleDoc Spreadsheet prior to the interview. Sometimes this was not done by the interviewer but rather by someone else working on behalf of the archive. Durring the interview the interviewer could tell which data fields were empty by looking for the gray cells in the spreadsheet. However, just because the cells were did not mean that the interviewee was more prone to provide the desired, unknown, information.

Grey Areas Show Metadata fields which are empty

Data Input Challenges

One unanticipated challenge which was encountered in the interviews was that as the interviewer would bring out an album or two of photos that the interviewees would be able to cover more photos than the interviewer could record.

Let me spell it out. There is one interviewer and two interviewees there are 150 photos in an album lying open on the table. All three participants are looking at the photo album. The interviewee A says look that is so-and-so and then interviewee B (because the other page is closer to them) says and this is so-and-so! This happens for about 8 of the 12 facing photos. Because the interviewer is still typing the first name mentioned they ask and when do you think that was? But the metadata still comes in faster, as the second interviewee did not hear the question and the first one did but still thinking. The bottom line is that more photos are viewed and commented on faster than can be recorded.

Something that could help this process would be to in some way to slow-down (or moderate) the ability of the interviewee(s) to access the photos. Something that could synchronize the processing times with the viewing times. By scanning the photos and then displaying them on a tablet it slows down the viewing process and integrates the recording of data with the viewing of photos.

Positional Interaction Challenges

An interview is, at some level, an interaction. One question which comes up is How does the technology used affect that interaction? What we found was that a laptop usually was situated between the interviewer and the interviewees. This positioned the parties in an apposing manner. Rather than the content becoming the central focus of both parties, the content was either in front of the interviewer or in front of the interviewees. A tablet changes this dynamic in the interaction. It brings both parties together over a single set of content, both positionally and cognitively. When the photo is displayed on the laptop, the laptop has to be rotated so that the interviewees can see the image and then turned so that the interviewer can input the data. This is not the case for a tablet.

Content Management Challenges

When Paper is used for collecting metadata it is ideal to have one piece of paper for each photo. Sometimes this method is preferable to using a single computer. I used this method when I had a photo display and about 20 albums and about 200 people all filling out details at once.

People filling out metadata forms infront of a photo display.

People came and went as they pleased. When someone recognized someone or someplace they knew, they wrote down the picture ID and the info they were contributing along with their name. However, carrying around photo albums and paper there is the challenge of keeping all the photos from getting damaged, and maintaining the order of the photos and associated papers.

Connectivity Challenges

When there is no internet there is no access to GoogleDocs. We encountered this when we went to someone's apartment, expecting interent because the interent is available on campus and this apartment was also on campus. Fortunately we did have a back up plan and paper pen was used. But this means that we now had to type out the data, which was written down on the paper; in effect doing the same recording work twice.

Size of Devices

Photo albums have a certain bulk and cumbersome-ness which is multiplied when carrying more than one album at a time. Add to this a computer laptop and one might as well add to the list of required items, a hand truck with which to carry everything. A tablet is all in all a lot smaller and lighter.

</ref></p> " data-medium-file="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?fit=300%2C198&ssl=1" data-large-file="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?fit=584%2C386&ssl=1" src="https://hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg" alt="Laptop and Tablet" width="620" height="410" class="size-full wp-image-2804" srcset="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?resize=620%2C410&ssl=1 620w, https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?resize=300%2C198&ssl=1 300w" sizes="auto, (max-width: 584px) 100vw, 584px" />

Laptop and TabletThis image is credited to Alia Haley ^[2] Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [Link]

Proof of Concept Technology

As I mentioned before, I had an iPad in my possession for a few days. So to capitalize on the opportunity, I bought a few apps from the app store, as I mentioned that I would and tried them out.

Software which does not work for our purposes

Photoforge2
The first app I tried was Photoforge2. It is a highly rated app in the app store. I found that it delivered as promised. One could add or edit the IPTC and EXIF metadata. One could even edit where the photo was taken with a pin drop interface.

iPad Fotoforge Location Data

iPad Fotoforge Metadata Editor

Meta Editor
Meta Editor, another iPad app, which was also highly acclaimed performed task almost as well. Photoforge2 had some photo editing features which were not needed in our project. Whereas Meta Editor was focused only on metadata elements.

MetadataEditor Location Data

After using both applications it became apparent that neither would work for this project for at least two reasons:

Both applications edit the Standards based IPTC and EXIF metadata fields in photos. We have some custom metadata which does not fit into either of these fileds.One aspect of the technology being discussed, which might be helpful for readers to understand, is that these iPad applications actually embed the metadata into the photos. So when the photos are then taken off of the iPad the metadata travels with them. This is a desirable feature for presentation photos.
Even if we do embed the metadata with these apps the version of the photo being enriched is not the Archival version of the photo it is the Presentation version of the photo. We still need the data to become associated with the archival version of the photo.

Software with some really functional features

So we needed something with a mechanism for capturing our customized data. Two options were found which seemed to avail themselves as suitable for the task. One was ideal the other rapidly deployable. Understanding the iPads' place in the larger place of corporate architecture, relationship to the digital repository, the process of data flow from the point of collection to dissemination, will help us to visualize the particular challenges that the iPad presents solutions for. Once we see where the iPad sits in relationship to the rest of the digital landscape I think it will be fairly obvious why one solution is ideal and the other rapidly deployable.

Placement in the iPad in the Information Architecture Picture

In my previous post on Social Metadata Collection ^[3]Hugh J. Paterson III. 29 June 2011. The Journeyler. [Accessed: 13 September 2011] https://hugh.thejourneyler.org/social-meta-data-collection. [Link] I used the below image to show where the iPad was used in the metadata collection process.

Meta-data Collection Model

Since that time, as I have shown this image when I talk about this idea, I have become aware that the image is not detailed enough. Because it is not detailed enough it can lead to some wrong assumptions on how the iPad use being proposed actually works. So, I am presenting a new image with a greater level of detail to show how the iPad interacts with other corporate systems and workflows.

iPad Team as they fit with other digital elements.

There are several things to note here:

Member Disporia as represented here is not just members, it is their families, the people with whom these members worked, it is the members currently working and it the members living close at hand on campus, not just in disporia.
It is a copy of the presentation file which is pushed out to the iPad or the website for the Member Disporia. This copy of the file does not necessarily need to be brought back to the archive as long as the metadata is synced back appropriately.
The Institutional Repository for other corporate items is currently in a DSpace instance. However, it has not been decided for sure that photos will be housed in this same instance, or even in DSpace.

That said, it is important that the metadata be embedded in the presentation file of the image, as well as accessible to the Main container for the archival of the photos. The metadata also needs to sync between the iPad application and the Member Diaspora website. Metadata truly needs to flow through the entire system.

FileMaker Pro with File Maker Go

FileMaker Pro is a powerful database app. It could drive the Member Disporia website and then also sync with the iPad. This would be a one-stop solution and therefore and ideal solution. It is also complex and takes more skill to set up than I currently have, or I can currently spare to acquire. Both FileMaker Pro and its younger cousin Bento enable Photos to be embedded in the actual database.Several tips from the Bento forums on syncing photos which are part of the database:
Syncing pictures from Bento-Mac to Bento-iPad
Sync multiple photos or files from desktop to IPad
This is something which is important with regards to syncing with the iPad. To the best of my knowledge (and googling) no other database apps for the iPad or Android platforms allow for the syncing of photos within the app.

Bento
Bento is the rapidly deployable option.What are the differences between Bento 4 for Mac, Bento for iPad 1.1.x, and Bento for iPhone/iPod touch 1.1.x?
It took me about 2 hours (while doing other stuff) to download a trial version, find out how it worked, import my data from the GoogleDoc and then sync my database with the iPad.

Here is a YouTube video demonstrating my proof of concept using Bento.

httpv://youtu.be/_Eo5Ru0BF-k

Here is a series of iPad Screen shots.

Screen Long ways

Inputing Data

Data and Photo seen together.

Some outstanding issues

Geo-location of Photos in Bento. Bento version 4 does have location fileds which can be used with a pin drop interface to add location data to the appropiate fileds in the database. My proof of concept demo does not demonstrate this feature.Using Geo-location fields in Bento: Working with Location Fields in Bento
How to use Location fields in Bento for iPhone/iPad 1.1.1
Rapid reuse of data. Because the interview process naturally lends itself to eliciting the same kind of data over a multitude of photos a UX/UI element which allows the rapid reuse of data would be very practical. The kinds of data which would lend themselves to rapid reuse would be peoples' names, locations, dates, photographer, etc. This may mean being able to query a table of already input'd data values with an auto-suggest type function.

Custom iPad App

Of course there is also the option to develop a custom iPad app for just our purposes. This entails some other kinds of planning, including but not limited to:

Custom App development
Support plan
Deploy or develop possible Web-backend - if needed.

Kinds of custom metadata being collected.

The table in this section shows the kinds of questions we are asking in our interviews. It is not only provided for reference as a discussion of the Information Architecture for the storage and elements of the metadata schema is out of the scope of this discussion. The list of questions and values presented in the table was derived as a minimal set of questions based on issues of Image Workflow Processing, Intelectual Property and Permissions, Academic Merit and input from the controlled vocabulary's Caption and Keywording Guidelines ^[4] Controlled Vocabulary. Caption and Keywording Guidelines. [Accessed: 13 September 2011] http://www.controlledvocabulary.com/metalogging/ck_guidelines.html. [Link] which is part of their series on metalogging. The table also shows corresponding IPTC, and EXIF data fields. (Though they are currently empty because I have not filed them in.) Understanding the relationships of XMP, IPTC, and EXIF also help us to understand why and how the iPad tool needs to interact with other Archiving solutions. However, it is not within the scope of this post to discuss these differences.Some useful resources on these issues are noted here:

Photolinker Metadata Tags ^[5] Early Innovations, LLC. 2011. Photolinker Metadata Tags. [Accessed: 13 September 2011] http://www.earlyinnovations.com/photolinker/metadata-tags.html. [Link] has a nice display outlining where XMP, IPTC and EXIF data overlap. This is not authoritative, but rather practical.
List of IPTC fields: List of IPTC fields. However, a list is not enough we also need to know what they mean so that we know that we are using them correctly.
EXIF and IPTC Header Comments. Here is another list of IPTC fileds. This list also includes a list of list of EXIF fileds. (Again without definitions.)
Various programs and applications also add their own metadata fields in the IPTC section. Here is a mapping of some of the most popular ones: http://www.controlledvocabulary.com/imagedatabases/iptc_core_mapped.pdf
IPTC Standard Photo Metadata ^[6]David Riecks. 2010. IPTC Standard Photo Metadata (July 2010). International Press Telecommunications Council. [Accessed: 13 September 2011] … Continue reading http://www.iptc.org/std/photometadata/documentation/IPTC-PLUS-Metadata-Panel-UserGuide_6.pdf
Doublin Core with Photographs: http://makeit.digitalnz.org/askaquestion/questions/26
Dublin Core Metadata Element Set, Version 1.1: http://dublincore.org/documents/dces/
DCMI Type Vocabulary: http://dublincore.org/documents/dcmi-type-vocabulary/
Describing Digital Content: http://makeit.digitalnz.org/guidelines/describing-digital-content/

It is sufficient to note that there is some, and only some overlap.

Metadata Element	Purpose	Explanation
Photo Collection	This is the name of the collection in which the photos reside
Sub Collection	This is the name of the sub collection in which the photos reside
Letter of Collection	Each collection is given an alpha character or a series of alpha characters, if the collection pertains to one people group then the alpha characters given to that collection are the three digit ISO 639-3 code
Who input the Meta-data	This is the name of the person inputting the metadata
Photo Number	This is the number of the photo as we have inventoried the photo
Negative Number	This is the number of the photo as it appears on the negative (film strip)
Roll	This is the ID of the Roll	Most sets of negatives are cut into strips of 5 or less this allows us to group these sets together to ID a “set” of photos
Section Number	If the items are in a book or a scrap book and that scrap book has a section this is where that is recoreded
Page#	If a scrap book has a set of pages then this is where they are recoreded
Duplicates	This is where the Photo ID of a duplicate item is referenced.
Old Inventory Number(s)	This is the inventory number of an item if it were part of another invenotry system
Photographer	This is the name of the photographer
Subject 1 (who)	Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 2		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 3		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 4		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 5		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
People group	This is the name of the people group meneined in the ISO 639-3 codes
ISO 639-3 Code	This is the ISO 639-3 code of the people group being photographed
When was the photo Taken?	The date the photo was taken
Country	The country in which the photo was taken
District/City	This is the City where the photo was taken
Exact Place	The exact place name where the photo was taken
What is in the Photo (what)	This is an item in the photo
What is in the Photo	Additional what is in the photo
What is in the Photo	Addtional what is in the photo
Why was the Photo Taken?		This is to help metadata providers think about how events get communicated
Description	This is a description of the photo’s contents	This is not a caption but could be used as a caption
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
I am in this photo and I approve it to be on the internet. Put in "yes" or "No" and write your name in the next column.	Permission to distribute
Name:	Name of the person releasing the photo
How was this photo digitized?	Method of digitization and the tools used in digitization
Who digitized This photo	This is the name of the person who did the digitization

References[+]

References
↑1	Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [<a href="http://churchm.ag/tablet-vs-laptop" title="Tablet vs Laptop">Link</a>]
↑2	Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [Link]
↑3	Hugh J. Paterson III. 29 June 2011. The Journeyler. [Accessed: 13 September 2011] https://hugh.thejourneyler.org/social-meta-data-collection. [Link]
↑4	Controlled Vocabulary. Caption and Keywording Guidelines. [Accessed: 13 September 2011] http://www.controlledvocabulary.com/metalogging/ck_guidelines.html. [Link]
↑5	Early Innovations, LLC. 2011. Photolinker Metadata Tags. [Accessed: 13 September 2011] http://www.earlyinnovations.com/photolinker/metadata-tags.html. [Link]
↑6	David Riecks. 2010. IPTC Standard Photo Metadata (July 2010). International Press Telecommunications Council. [Accessed: 13 September 2011] http://www.iptc.org/std/photometadata/documentation/IPTC-PLUS-Metadata-Panel-UserGuide_6.pdf [Link]

TIFFs, PDFs and OCR

Posted on August 16, 2011 by Hugh Paterson III

In the course of my experience I have been asked about PDFs and OCR several times. The questions usually follow the main two questions of this post.

So is OCR built into PDFs? or is there a need for independent OCR?

In particular an image based PDF, is it searchable?

The Short answer is Yes. Adobe Acrobat Pro has an OCR function built in. And to the second part: No, an image is not searchable. But what can happen is that Adobe Acrobat Pro can perform an OCR function to an image such as a .tiff file and then add a layer of text, (the out put of the OCR process) behind the image. Then when the PDF is searched it actually searches the text layer which is behind the image and tries to find the match. The OCR process is usually between 80-90% accurate on texts in english. This is usually good enough for finding words or partial words.

The Data Conversion Laboratory has a really nice and detailed write up on the process of converting from images to text with Adobe Acrobat Pro.

Daily Designer has a tutorial on how to do it on OS X.
David R. Mankin explains on his blog what the process looks like using Windows.

One of the beauties of Adobe Acrobat Pro is that this process can be scripted and the TIFFs processed in batches.
[On Windows] :: [On OS X using AppleScript] :: [Cross platform help from Adobe]

University Illinois Chicago explains how to do use Adobe Acrobat Pro and OCR with a scanner using a TWAIN driver.

The better OCR option

Since I work in an industry where we are dealing with multiple languages and the need to professionally OCR thousands of documents I thought I would provide a few links on the comparison of OCR software on the market.
Lifehacker has short write up of the top five OCR tools.

Of those top 5, in this article, two, ABBYY Fine Reader and Adobe Acrobat are compared side by side on both OS X and Windows.

Are all files used to create an orignal PDF included in the PDF?

One thing to remember, Which I have said before, is that not all PDFs are created equal. This Manual talks a bit about different settings inside of PDFs when using Adobe’s PDF printer.

The Short answer is No. But the long answer is Yes. Depending on the settings of the PDF creator the original files might be altered before they are wrapped in a PDF wrapper.

So the objection, usually in the form of a question sometimes comes up:

Is the PDF file just using the PDF framework as a wrapper around the original content? Therefore, to archive things “properly” do I still need to keep the .tiff images if they are included in the PDF document?

The answer is: “it depends”. It depends on several things, one of which is, what program created the PDF and how it created the PDF. – Did it send the document through PostScript first? Another thing that it depends on is what else might one want to do with the .tiff files?

In an archiving mentality, the real question is: “Should the .tiff files also be saved?” The best practice answer is Yes. The reason is that the PDF is viewed as a presentation version and the .tiff files are views as the digital “originals”.

Metadata Magic

Posted on August 10, 2011 by Hugh Paterson III

The company I work for has an archive for many kinds of materials. In recent times this company has moved to start a digital repository using DSpace. To facilitate contributions to the repository the company has built an Adobe AIR app which allows for the uploading of metadata to the metadata elements of DSpace as well as the attachement of the digital item to the proper bitstream. Totally Awesome.

However, one of the challenges is that just because the metadata is curated, collected and properly filed, it does not mean that the metadata is embedded in the digital items uploaded to the repository. PDFs are still being uploaded with the PDF’s author attribute set to Microsoft-WordMore about the metadata attributes of PDF/A can be read about on pdfa.org. Not only is the correct metadata and the wrong metadata in the same place at the same time (and being uploaded at the same time) later, when a consumer of the digital file downloads the file, only the wrong metadata will travel with the file. This is not just happening with PDFs but also with .mp3, .wav, .docx, .mov, .jpg and a slew of other file types. This saga of bad metadata in PDFs has been recognized since at least 2004 by James Howison & Abby Goodrum. 2004. Why can’t I manage academic papers like MP3s? The evolution and intent of Metadata standards.

So, today I was looking around to see if Adobe AIR can indeed use some of the available tools to propagate the correct metadata in the files before upload so that when the files arrive in DSpace that they will have the correct metadata.

The first step is to retrieve metadata from files. It seems that Adobe AIR can do this with PDFs. (One would hope so as they are both brain children of the geeks at Adobe.) However, what is needed in this particular set up is a two way street with a check in between. We would need to overwrite what was there with the data we want there.
However, as of 2009, there were no tools in AIR which could manipulate exif Data (for photos).
But it does look like the situation is more hopeful for working with audio metadata.

One way around the limitations of JavaScript itself might be to use JavaScript to call a command-line tool or execute a python, perl, or shell script, or even use a library. There are some technical challenges which need bridged when using these kinds of tools in a cross-platform environment. (Anything from flavors of Linux to, OS X 10.4-10.7 and Windows XP – Current.) This is mostly because of the various ways of implementing scripts on differnt platforms.

The technical challenge is that Adobe AIR is basically a JavaScript environment. As such there are certain technical challenges around implementation of command-line tools like Xpdf from fooLabs and Coherent PDF Tools or Phil Harvey’s ExifTool, Exifv2, pdftk, or even TagLib. One of the things that Adobe AIR can do is call an executable via something called actionscript. There are even examples of how to do this with PDF Metadata. This method uses PurePDF, a complete actionscript PDF library. Actionscript is powerful in and of itself, it can be used to call the XMP metadata of a PDF, Though one could use it to call on Java to do the same “work”.

Three Lingering Thoughts

Even if the Resource and Metadata Packager has the abilities to embed the metadata in the files themselves, it does not mean that the submitters would know about how to use them or why to use them. This is not, however, a valid reason to not include functionality in a development project. All marketing aside, an archive does have a responsibility to consumers of the digital content, that the content will be functional. Part of today’s “functional” is the interoperability of metadata. Consumers do appreciate – even expect – that the metadata will be interoperable. The extra effort taken on the submitting end of the process, pays dividends as consumers use the files with programs like Picasa, iPhoto, PhotoShop, iTunes, Mendeley, Papers, etc.
Another thought that comes to mind is that When one is dealing with large files (over 1 GB) It occurs to me that there is a reason for making a “preview” version of a couple of MB. That is if I have a 2 GB audio file, why not make 4 MB .mp3 file for rapid assessment of the file to see if it is worth downloading the .wav file. It seems that a metadata packager could also create a presentation file on the fly too. This is no-less true with photos or images. If a command-line tool could be used like imagemagick, that would be awesome.
This problem has been addressed in the open source library science world. In fact a nice piece of software does live out there. It is called the Metadata Extraction Tool. It is not an end-all for all of this archive’s needs but it is a solution for some needs of this type.

Photos and Metadata

Posted on July 19, 2011 by Hugh Paterson III

There are several things to think about with respect to photos and metadata.

1. The What: this is the elements of the metadata's data. The "Who", "What", "Where", "When", "Why and "How" of the photo.

2. The How: The technical storage of the metadata. Where is all of this data stored.

These two issues are discussed in their own child pages. There is a lot to say on each one.

In brief the What tries to answer the question What kind of meta-data should or can be collected?

Whereas the How tries to answer the question What should or can be be done with this meta-data?

File Naming convention: http://www.controlledvocabulary.com/imagedatabases/filenaming.html

Keywords: http://www.controlledvocabulary.com/metalogging/ck_guidelines.htm

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Category Archives: Images