Citations, Names and Language Documentation

Posted on September 30, 2011 by Hugh Paterson III

I have recently been reading the blog of Martin Fenner and came upon the article Personal names around the world ^[1] Martin Fenner. 14 August 2011. Personal names around the world. PLoS Blog Network. http://blogs.plos.org/mfenner/2011/08/14/personal-names-around-the-world . [Accessed: 16 September 2011]. [Link] . His post is in fact a reflection on a W3C paper on Personal Names around the WorldSeveral other reflections are here: http://www.w3.org/International/wiki/Personal_names (same title). This is apparently coming out of the i18n effort and is an effort to help authors and database designers make informed decisions about names on the web.
I read Martin’s post with some interest because in Language Documentation getting someone’s name as a source or for informed consent is very important (from a U.S. context). Working in a archive dealing with language materials, I see lot of names. One of the interesting situations which came to me from an Ecuadorian context was different from what I have seen in the w3.org paper or in the w3.org discussion. The naming convention went like this:

The elder was known by the younger’s name plus a relationship.

My suspicion is that it is a taboo to name the dead. So to avoid possibly naming the dead, the younger was referenced and the the relationship was invoked. This affected me in the archive as I am supposed to note who the speaker is on the recordings. In lue of the speakers name, I have the young son’s first name, who is well known in the community, and is in his 30’s or so, and I have the relationship. So in English this might sound like John’s mother. Now what am I supposed to put in the metadata record for the audio recordings I am cataloging? I do not have a name but I do have a relationship to a known (to the community) person.

I inquired with a literacy consultant who has worked in Ecuador with indigenous people for some years, she informed me that in one context she was working in everyone knew what family line they were from and all the names were derived from that family line by position. It was of such that to call someone by there name was an insult.

It sort of reminds me of this sketch by Fry and Laurie.

References[+]

References
↑1	Martin Fenner. 14 August 2011. Personal names around the world. PLoS Blog Network. http://blogs.plos.org/mfenner/2011/08/14/personal-names-around-the-world . [Accessed: 16 September 2011]. [Link]

JetPack, Menus, and Others

Posted on September 25, 2011 by Hugh Paterson III

Sharing

Jetpack is in no way new… But I have never installed it (it seems that half a million other people have though). The only service I have used from Automattic is akismet. Then about a month ago I installed after the dead line as a Google Chrome plugin to help me with my spelling mistakes. It seemed to work so I thought I would give it a go as a WordPress plugin.

What was new was that I had not integrated a sharing solution for readers of my blog. So as of now there is a share this option at the end of my posts.

Sharing options

Of course Sharedaddy, the sharing plugin did not have a Google +1 sharing option, nor del.ic.ious sharing option. So I had to find some solutions. I found a fork of Sharedaddy on github which had added Google+ and LinkedIn. (I am not on Google+ but I just joined LinkedIn last week as I was redoing my resume).
To add delicious I followed a post by Ryan Markel to find the right share service URLs.

Menus

The other thing I figured out this week was how to use the Menus Feature under the Appearance tab. I have been using K2 since 2005 and have always thought that the menus in the default theme were sufficient. I have usually not had complex menu desires. So there was no real need to learn these new features, however. Now I wanted to put several picture pages under the same menu. So wal-la. It is done now.

New menu settings

Others

(Mostly RDFa and HTML5)
I also have a plugin that is adding Open Graph RDFa tags to my theme. My current version of K2 is HTML5 but, it is not validating with the RDFa tags in it. So I was trying to validate them but have not been successful. I looked at this answer which said to add something to the doctype. But then there is more answers too. Sometimes these answers are beyond me. I which I had some structured learning in this subject area.

Why RDF?

And RDFa is the basis of Open Graph, the technology used to sync FaceBook Likes between my site and FaceBook.

The iPad Team

Posted on September 16, 2011 by Hugh Paterson III

The Concept

I have had some ideas I wanted to try out for using the iPad as a tool for collecting photo metadata. Working in a corporate archive, I have become aware of several collections of photos without much metadata associated with them.

The photos are the property of (or are in the custodial care of) the company I work at (in their corporate archive).

The subject of the photos are either of two general areas:

The minority language speaking people groups that the employees of a particular company worked with, including anthropological topics like ways of life, etc.
Photos of operational events significant to telling the story of the company holding the photos.

Archives in more modern contexts are trying to show their relevance to not only academics, but also to general members of communities. In the United States there is a whole movement of social history. There are community preservation societies which take on the task of collecting old photographs and their stories and preserving, and presenting them for future generations.

The challenge at hand is: "How do we enrich photos by adding metadata to photos in the collections of archives?" There are many solutions to this kind of task. The refining, distilling, and reduction of stories and memories to writing and even to metadata fields is no easy task, nor is it a task that one person can do on their own. One solution, which is often employed by community historians is the personal interview. By interviewing the photographers or people who were at an event and asking them questions about a series of photos it presents an atmosphere of inquisitiveness and one where the story-teller is valued because they have a story-listener. This basic personal connection allows for interactions to occur across generational and technological barriers.

The crucial question is: "How do we facilitate an interaction which is positive for all the parties involved?" The effort and thinking behind answering this question has more to do with shaping human interactions than with anything else. We are also talking about using technology in this interaction. This is true UX or (User Experience).

Past Experience

This past summer I have had several experiences with facilitating one-on-one interactions between knowledgeable parties working with photographs and with someone acting on behalf of the corporate archive. To facilitate this interaction a GoogleDoc Spreadsheet was set up and the person acting on the behalf of the archive was granted access to the spreadsheet. The individual conducting the interview and listening to the stories brought their own netbook (small laptop) from which to enter any collected data. They were also given a photo album full of photos, which the interviewee would look through. This set-up required overcoming several local environmental challenges. As discussed below, some of these challenges were better addressed than others.

Association of Data to a Given Photo

The challenge of keeping up to 150 photos organized durring an interview so that metadata about any given photo could be collected and associated with only that photo. This was addressed by adhering an inventory sticker to the back of each photo and assigning each photo a single row in the GoogleDoc Spreadsheet. Using GoogleDocs was not the ideal solution, but rather than a solution of some compromises:

Strengths of GoogleDocs

One of the great things about GoogleDocs is that the capability exists for multiple people to edit the spreadsheet simultaneously.
Another strength of GoogleDocs is that there is a side bar chat feature so that if there is a question durring the interview that help could be had very quickly from management (me, who was offsite).
The Data can be exported in the following formats: .xlsx , .xls , .csv , .pdf.
There was no cost to deploy the technology.
It is accessible through a web-browser in an OS neutral manner.
The document is available wherever the internet is available.
A single solution could be deployed and used by people digitizing photos, recording written metadata on the photos, and gathering metadata during an interview.
Most people acting on behalf of the archive were familiar with the technology.

Pitfalls of GoogleDocs

More columns exist in the spread sheet than can be practically managed (The columns are presented below in a table). There are about 48 values in a record and there are about 40,000 records.

More columns than can be practically managed

Does not display the various levels of data as levels of data as levels in the user interface.
Cannot remove unnecessary fields from the UI of various people. (No role-based support.)
Only available when there is internet.

Maximizing of Interview Time

To maximize time spent with the interviewee the photos and any metadata written or known about a photo was put into the GoogleDoc Spreadsheet prior to the interview. Sometimes this was not done by the interviewer but rather by someone else working on behalf of the archive. Durring the interview the interviewer could tell which data fields were empty by looking for the gray cells in the spreadsheet. However, just because the cells were did not mean that the interviewee was more prone to provide the desired, unknown, information.

Grey Areas Show Metadata fields which are empty

Data Input Challenges

One unanticipated challenge which was encountered in the interviews was that as the interviewer would bring out an album or two of photos that the interviewees would be able to cover more photos than the interviewer could record.

Let me spell it out. There is one interviewer and two interviewees there are 150 photos in an album lying open on the table. All three participants are looking at the photo album. The interviewee A says look that is so-and-so and then interviewee B (because the other page is closer to them) says and this is so-and-so! This happens for about 8 of the 12 facing photos. Because the interviewer is still typing the first name mentioned they ask and when do you think that was? But the metadata still comes in faster, as the second interviewee did not hear the question and the first one did but still thinking. The bottom line is that more photos are viewed and commented on faster than can be recorded.

Something that could help this process would be to in some way to slow-down (or moderate) the ability of the interviewee(s) to access the photos. Something that could synchronize the processing times with the viewing times. By scanning the photos and then displaying them on a tablet it slows down the viewing process and integrates the recording of data with the viewing of photos.

Positional Interaction Challenges

An interview is, at some level, an interaction. One question which comes up is How does the technology used affect that interaction? What we found was that a laptop usually was situated between the interviewer and the interviewees. This positioned the parties in an apposing manner. Rather than the content becoming the central focus of both parties, the content was either in front of the interviewer or in front of the interviewees. A tablet changes this dynamic in the interaction. It brings both parties together over a single set of content, both positionally and cognitively. When the photo is displayed on the laptop, the laptop has to be rotated so that the interviewees can see the image and then turned so that the interviewer can input the data. This is not the case for a tablet.

Content Management Challenges

When Paper is used for collecting metadata it is ideal to have one piece of paper for each photo. Sometimes this method is preferable to using a single computer. I used this method when I had a photo display and about 20 albums and about 200 people all filling out details at once.

People filling out metadata forms infront of a photo display.

People came and went as they pleased. When someone recognized someone or someplace they knew, they wrote down the picture ID and the info they were contributing along with their name. However, carrying around photo albums and paper there is the challenge of keeping all the photos from getting damaged, and maintaining the order of the photos and associated papers.

Connectivity Challenges

When there is no internet there is no access to GoogleDocs. We encountered this when we went to someone's apartment, expecting interent because the interent is available on campus and this apartment was also on campus. Fortunately we did have a back up plan and paper pen was used. But this means that we now had to type out the data, which was written down on the paper; in effect doing the same recording work twice.

Size of Devices

Photo albums have a certain bulk and cumbersome-ness which is multiplied when carrying more than one album at a time. Add to this a computer laptop and one might as well add to the list of required items, a hand truck with which to carry everything. A tablet is all in all a lot smaller and lighter.

</ref></p> " data-medium-file="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?fit=300%2C198&ssl=1" data-large-file="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?fit=584%2C386&ssl=1" src="https://hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg" alt="Laptop and Tablet" width="620" height="410" class="size-full wp-image-2804" srcset="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?resize=620%2C410&ssl=1 620w, https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?resize=300%2C198&ssl=1 300w" sizes="auto, (max-width: 584px) 100vw, 584px" />

Laptop and TabletThis image is credited to Alia Haley ^[2] Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [Link]

Proof of Concept Technology

As I mentioned before, I had an iPad in my possession for a few days. So to capitalize on the opportunity, I bought a few apps from the app store, as I mentioned that I would and tried them out.

Software which does not work for our purposes

Photoforge2
The first app I tried was Photoforge2. It is a highly rated app in the app store. I found that it delivered as promised. One could add or edit the IPTC and EXIF metadata. One could even edit where the photo was taken with a pin drop interface.

iPad Fotoforge Location Data

iPad Fotoforge Metadata Editor

Meta Editor
Meta Editor, another iPad app, which was also highly acclaimed performed task almost as well. Photoforge2 had some photo editing features which were not needed in our project. Whereas Meta Editor was focused only on metadata elements.

MetadataEditor Location Data

After using both applications it became apparent that neither would work for this project for at least two reasons:

Both applications edit the Standards based IPTC and EXIF metadata fields in photos. We have some custom metadata which does not fit into either of these fileds.One aspect of the technology being discussed, which might be helpful for readers to understand, is that these iPad applications actually embed the metadata into the photos. So when the photos are then taken off of the iPad the metadata travels with them. This is a desirable feature for presentation photos.
Even if we do embed the metadata with these apps the version of the photo being enriched is not the Archival version of the photo it is the Presentation version of the photo. We still need the data to become associated with the archival version of the photo.

Software with some really functional features

So we needed something with a mechanism for capturing our customized data. Two options were found which seemed to avail themselves as suitable for the task. One was ideal the other rapidly deployable. Understanding the iPads' place in the larger place of corporate architecture, relationship to the digital repository, the process of data flow from the point of collection to dissemination, will help us to visualize the particular challenges that the iPad presents solutions for. Once we see where the iPad sits in relationship to the rest of the digital landscape I think it will be fairly obvious why one solution is ideal and the other rapidly deployable.

Placement in the iPad in the Information Architecture Picture

In my previous post on Social Metadata Collection ^[3]Hugh J. Paterson III. 29 June 2011. The Journeyler. [Accessed: 13 September 2011] https://hugh.thejourneyler.org/social-meta-data-collection. [Link] I used the below image to show where the iPad was used in the metadata collection process.

Meta-data Collection Model

Since that time, as I have shown this image when I talk about this idea, I have become aware that the image is not detailed enough. Because it is not detailed enough it can lead to some wrong assumptions on how the iPad use being proposed actually works. So, I am presenting a new image with a greater level of detail to show how the iPad interacts with other corporate systems and workflows.

iPad Team as they fit with other digital elements.

There are several things to note here:

Member Disporia as represented here is not just members, it is their families, the people with whom these members worked, it is the members currently working and it the members living close at hand on campus, not just in disporia.
It is a copy of the presentation file which is pushed out to the iPad or the website for the Member Disporia. This copy of the file does not necessarily need to be brought back to the archive as long as the metadata is synced back appropriately.
The Institutional Repository for other corporate items is currently in a DSpace instance. However, it has not been decided for sure that photos will be housed in this same instance, or even in DSpace.

That said, it is important that the metadata be embedded in the presentation file of the image, as well as accessible to the Main container for the archival of the photos. The metadata also needs to sync between the iPad application and the Member Diaspora website. Metadata truly needs to flow through the entire system.

FileMaker Pro with File Maker Go

FileMaker Pro is a powerful database app. It could drive the Member Disporia website and then also sync with the iPad. This would be a one-stop solution and therefore and ideal solution. It is also complex and takes more skill to set up than I currently have, or I can currently spare to acquire. Both FileMaker Pro and its younger cousin Bento enable Photos to be embedded in the actual database.Several tips from the Bento forums on syncing photos which are part of the database:
Syncing pictures from Bento-Mac to Bento-iPad
Sync multiple photos or files from desktop to IPad
This is something which is important with regards to syncing with the iPad. To the best of my knowledge (and googling) no other database apps for the iPad or Android platforms allow for the syncing of photos within the app.

Bento
Bento is the rapidly deployable option.What are the differences between Bento 4 for Mac, Bento for iPad 1.1.x, and Bento for iPhone/iPod touch 1.1.x?
It took me about 2 hours (while doing other stuff) to download a trial version, find out how it worked, import my data from the GoogleDoc and then sync my database with the iPad.

Here is a YouTube video demonstrating my proof of concept using Bento.

httpv://youtu.be/_Eo5Ru0BF-k

Here is a series of iPad Screen shots.

Screen Long ways

Inputing Data

Data and Photo seen together.

Some outstanding issues

Geo-location of Photos in Bento. Bento version 4 does have location fileds which can be used with a pin drop interface to add location data to the appropiate fileds in the database. My proof of concept demo does not demonstrate this feature.Using Geo-location fields in Bento: Working with Location Fields in Bento
How to use Location fields in Bento for iPhone/iPad 1.1.1
Rapid reuse of data. Because the interview process naturally lends itself to eliciting the same kind of data over a multitude of photos a UX/UI element which allows the rapid reuse of data would be very practical. The kinds of data which would lend themselves to rapid reuse would be peoples' names, locations, dates, photographer, etc. This may mean being able to query a table of already input'd data values with an auto-suggest type function.

Custom iPad App

Of course there is also the option to develop a custom iPad app for just our purposes. This entails some other kinds of planning, including but not limited to:

Custom App development
Support plan
Deploy or develop possible Web-backend - if needed.

Kinds of custom metadata being collected.

The table in this section shows the kinds of questions we are asking in our interviews. It is not only provided for reference as a discussion of the Information Architecture for the storage and elements of the metadata schema is out of the scope of this discussion. The list of questions and values presented in the table was derived as a minimal set of questions based on issues of Image Workflow Processing, Intelectual Property and Permissions, Academic Merit and input from the controlled vocabulary's Caption and Keywording Guidelines ^[4] Controlled Vocabulary. Caption and Keywording Guidelines. [Accessed: 13 September 2011] http://www.controlledvocabulary.com/metalogging/ck_guidelines.html. [Link] which is part of their series on metalogging. The table also shows corresponding IPTC, and EXIF data fields. (Though they are currently empty because I have not filed them in.) Understanding the relationships of XMP, IPTC, and EXIF also help us to understand why and how the iPad tool needs to interact with other Archiving solutions. However, it is not within the scope of this post to discuss these differences.Some useful resources on these issues are noted here:

Photolinker Metadata Tags ^[5] Early Innovations, LLC. 2011. Photolinker Metadata Tags. [Accessed: 13 September 2011] http://www.earlyinnovations.com/photolinker/metadata-tags.html. [Link] has a nice display outlining where XMP, IPTC and EXIF data overlap. This is not authoritative, but rather practical.
List of IPTC fields: List of IPTC fields. However, a list is not enough we also need to know what they mean so that we know that we are using them correctly.
EXIF and IPTC Header Comments. Here is another list of IPTC fileds. This list also includes a list of list of EXIF fileds. (Again without definitions.)
Various programs and applications also add their own metadata fields in the IPTC section. Here is a mapping of some of the most popular ones: http://www.controlledvocabulary.com/imagedatabases/iptc_core_mapped.pdf
IPTC Standard Photo Metadata ^[6]David Riecks. 2010. IPTC Standard Photo Metadata (July 2010). International Press Telecommunications Council. [Accessed: 13 September 2011] … Continue reading http://www.iptc.org/std/photometadata/documentation/IPTC-PLUS-Metadata-Panel-UserGuide_6.pdf
Doublin Core with Photographs: http://makeit.digitalnz.org/askaquestion/questions/26
Dublin Core Metadata Element Set, Version 1.1: http://dublincore.org/documents/dces/
DCMI Type Vocabulary: http://dublincore.org/documents/dcmi-type-vocabulary/
Describing Digital Content: http://makeit.digitalnz.org/guidelines/describing-digital-content/

It is sufficient to note that there is some, and only some overlap.

Metadata Element	Purpose	Explanation
Photo Collection	This is the name of the collection in which the photos reside
Sub Collection	This is the name of the sub collection in which the photos reside
Letter of Collection	Each collection is given an alpha character or a series of alpha characters, if the collection pertains to one people group then the alpha characters given to that collection are the three digit ISO 639-3 code
Who input the Meta-data	This is the name of the person inputting the metadata
Photo Number	This is the number of the photo as we have inventoried the photo
Negative Number	This is the number of the photo as it appears on the negative (film strip)
Roll	This is the ID of the Roll	Most sets of negatives are cut into strips of 5 or less this allows us to group these sets together to ID a “set” of photos
Section Number	If the items are in a book or a scrap book and that scrap book has a section this is where that is recoreded
Page#	If a scrap book has a set of pages then this is where they are recoreded
Duplicates	This is where the Photo ID of a duplicate item is referenced.
Old Inventory Number(s)	This is the inventory number of an item if it were part of another invenotry system
Photographer	This is the name of the photographer
Subject 1 (who)	Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 2		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 3		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 4		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 5		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
People group	This is the name of the people group meneined in the ISO 639-3 codes
ISO 639-3 Code	This is the ISO 639-3 code of the people group being photographed
When was the photo Taken?	The date the photo was taken
Country	The country in which the photo was taken
District/City	This is the City where the photo was taken
Exact Place	The exact place name where the photo was taken
What is in the Photo (what)	This is an item in the photo
What is in the Photo	Additional what is in the photo
What is in the Photo	Addtional what is in the photo
Why was the Photo Taken?		This is to help metadata providers think about how events get communicated
Description	This is a description of the photo’s contents	This is not a caption but could be used as a caption
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
I am in this photo and I approve it to be on the internet. Put in "yes" or "No" and write your name in the next column.	Permission to distribute
Name:	Name of the person releasing the photo
How was this photo digitized?	Method of digitization and the tools used in digitization
Who digitized This photo	This is the name of the person who did the digitization

References[+]

References
↑1	Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [<a href="http://churchm.ag/tablet-vs-laptop" title="Tablet vs Laptop">Link</a>]
↑2	Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [Link]
↑3	Hugh J. Paterson III. 29 June 2011. The Journeyler. [Accessed: 13 September 2011] https://hugh.thejourneyler.org/social-meta-data-collection. [Link]
↑4	Controlled Vocabulary. Caption and Keywording Guidelines. [Accessed: 13 September 2011] http://www.controlledvocabulary.com/metalogging/ck_guidelines.html. [Link]
↑5	Early Innovations, LLC. 2011. Photolinker Metadata Tags. [Accessed: 13 September 2011] http://www.earlyinnovations.com/photolinker/metadata-tags.html. [Link]
↑6	David Riecks. 2010. IPTC Standard Photo Metadata (July 2010). International Press Telecommunications Council. [Accessed: 13 September 2011] http://www.iptc.org/std/photometadata/documentation/IPTC-PLUS-Metadata-Panel-UserGuide_6.pdf [Link]

Language Documentation and the Datum

Posted on September 3, 2011 by Hugh Paterson III

The importance of knowing about the Datum ^[1]Wikipedia contributors. Datum (geodesy). Wikipedia, The Free Encyclopedia. 3 April 2011, 00:28 UTC. Available at: http://en.wikipedia.org/w/index.php?title=Datum_(geodesy)&oldid=422063702. [Accessed … Continue reading recently came to my attention as I was working with GIS data on a Language Documentation project. We were collecting GPS coordinates with a handheld GPS unit and comparing these coordinates with data supplied by the national cartographic office. End goal was to compare data samples collected with conclusions proposed by the national cartographic office.

So, what am I talking about?

GIS data is used in a Geographical Information System. Basically, you can think of maps and what you might want to show with a map: rivers, towns, roads, language features, dialect markers, etc. Well, maps are shapes superimposed with a grid. And coordinates are a way of naming where on a particular grid a given point is located.

Continue reading →

References[+]

References
↑1	Wikipedia contributors. Datum (geodesy). Wikipedia, The Free Encyclopedia. 3 April 2011, 00:28 UTC. Available at: http://en.wikipedia.org/w/index.php?title=Datum_(geodesy)&oldid=422063702. [Accessed 5 May 2011] [Link]

Letting Go

Posted on August 29, 2011 by Hugh Paterson III

Working in an archive, one can imagine that letting go of materials is a real challenge. Both in that it is hard to do becasue of policy, but also because it is hard to do because of the emotional “pack-rat” nature of archivist. This is no less the case of the archive where I work. We were recently working through a set of items and getting rid of the duplicates. (Physical space has its price; and the work should soon be available via JASOR.) However, one of the items we were getting rid of was a journal issue on a people group/language. The journal has three articles, of these, only one of them article was written by someone who worked for the same organization I am working for now. So the “employer” and owner-operator of the archive only has rights to one of the three works. (Rights by virtue of “work-for-hire” laws.) We have the the off-print, which is what we have rights to share, so we keep and share that. It all makes sense. However, what we keep is catalogued and inventoried. Our catalogue is shared with the world via OLAC. With this tool someone can search for a resource on a language, by language. It occurs to me that the other two articles on this people group/language will not show in the aggregation of results of OLAC. This is a shame as it would be really helpful in many ways. I wish there was a groundswell, open source, grassroots web facilitated effort where various researchers can go and put metadata (citations) of articles and then they would be added to the OLAC search.

TIFFs, PDFs and OCR

Posted on August 16, 2011 by Hugh Paterson III

In the course of my experience I have been asked about PDFs and OCR several times. The questions usually follow the main two questions of this post.

So is OCR built into PDFs? or is there a need for independent OCR?

In particular an image based PDF, is it searchable?

The Short answer is Yes. Adobe Acrobat Pro has an OCR function built in. And to the second part: No, an image is not searchable. But what can happen is that Adobe Acrobat Pro can perform an OCR function to an image such as a .tiff file and then add a layer of text, (the out put of the OCR process) behind the image. Then when the PDF is searched it actually searches the text layer which is behind the image and tries to find the match. The OCR process is usually between 80-90% accurate on texts in english. This is usually good enough for finding words or partial words.

The Data Conversion Laboratory has a really nice and detailed write up on the process of converting from images to text with Adobe Acrobat Pro.

Daily Designer has a tutorial on how to do it on OS X.
David R. Mankin explains on his blog what the process looks like using Windows.

One of the beauties of Adobe Acrobat Pro is that this process can be scripted and the TIFFs processed in batches.
[On Windows] :: [On OS X using AppleScript] :: [Cross platform help from Adobe]

University Illinois Chicago explains how to do use Adobe Acrobat Pro and OCR with a scanner using a TWAIN driver.

The better OCR option

Since I work in an industry where we are dealing with multiple languages and the need to professionally OCR thousands of documents I thought I would provide a few links on the comparison of OCR software on the market.
Lifehacker has short write up of the top five OCR tools.

Of those top 5, in this article, two, ABBYY Fine Reader and Adobe Acrobat are compared side by side on both OS X and Windows.

Are all files used to create an orignal PDF included in the PDF?

One thing to remember, Which I have said before, is that not all PDFs are created equal. This Manual talks a bit about different settings inside of PDFs when using Adobe’s PDF printer.

The Short answer is No. But the long answer is Yes. Depending on the settings of the PDF creator the original files might be altered before they are wrapped in a PDF wrapper.

So the objection, usually in the form of a question sometimes comes up:

Is the PDF file just using the PDF framework as a wrapper around the original content? Therefore, to archive things “properly” do I still need to keep the .tiff images if they are included in the PDF document?

The answer is: “it depends”. It depends on several things, one of which is, what program created the PDF and how it created the PDF. – Did it send the document through PostScript first? Another thing that it depends on is what else might one want to do with the .tiff files?

In an archiving mentality, the real question is: “Should the .tiff files also be saved?” The best practice answer is Yes. The reason is that the PDF is viewed as a presentation version and the .tiff files are views as the digital “originals”.

Metadata Magic

Posted on August 10, 2011 by Hugh Paterson III

The company I work for has an archive for many kinds of materials. In recent times this company has moved to start a digital repository using DSpace. To facilitate contributions to the repository the company has built an Adobe AIR app which allows for the uploading of metadata to the metadata elements of DSpace as well as the attachement of the digital item to the proper bitstream. Totally Awesome.

However, one of the challenges is that just because the metadata is curated, collected and properly filed, it does not mean that the metadata is embedded in the digital items uploaded to the repository. PDFs are still being uploaded with the PDF’s author attribute set to Microsoft-WordMore about the metadata attributes of PDF/A can be read about on pdfa.org. Not only is the correct metadata and the wrong metadata in the same place at the same time (and being uploaded at the same time) later, when a consumer of the digital file downloads the file, only the wrong metadata will travel with the file. This is not just happening with PDFs but also with .mp3, .wav, .docx, .mov, .jpg and a slew of other file types. This saga of bad metadata in PDFs has been recognized since at least 2004 by James Howison & Abby Goodrum. 2004. Why can’t I manage academic papers like MP3s? The evolution and intent of Metadata standards.

So, today I was looking around to see if Adobe AIR can indeed use some of the available tools to propagate the correct metadata in the files before upload so that when the files arrive in DSpace that they will have the correct metadata.

The first step is to retrieve metadata from files. It seems that Adobe AIR can do this with PDFs. (One would hope so as they are both brain children of the geeks at Adobe.) However, what is needed in this particular set up is a two way street with a check in between. We would need to overwrite what was there with the data we want there.
However, as of 2009, there were no tools in AIR which could manipulate exif Data (for photos).
But it does look like the situation is more hopeful for working with audio metadata.

One way around the limitations of JavaScript itself might be to use JavaScript to call a command-line tool or execute a python, perl, or shell script, or even use a library. There are some technical challenges which need bridged when using these kinds of tools in a cross-platform environment. (Anything from flavors of Linux to, OS X 10.4-10.7 and Windows XP – Current.) This is mostly because of the various ways of implementing scripts on differnt platforms.

The technical challenge is that Adobe AIR is basically a JavaScript environment. As such there are certain technical challenges around implementation of command-line tools like Xpdf from fooLabs and Coherent PDF Tools or Phil Harvey’s ExifTool, Exifv2, pdftk, or even TagLib. One of the things that Adobe AIR can do is call an executable via something called actionscript. There are even examples of how to do this with PDF Metadata. This method uses PurePDF, a complete actionscript PDF library. Actionscript is powerful in and of itself, it can be used to call the XMP metadata of a PDF, Though one could use it to call on Java to do the same “work”.

Three Lingering Thoughts

Even if the Resource and Metadata Packager has the abilities to embed the metadata in the files themselves, it does not mean that the submitters would know about how to use them or why to use them. This is not, however, a valid reason to not include functionality in a development project. All marketing aside, an archive does have a responsibility to consumers of the digital content, that the content will be functional. Part of today’s “functional” is the interoperability of metadata. Consumers do appreciate – even expect – that the metadata will be interoperable. The extra effort taken on the submitting end of the process, pays dividends as consumers use the files with programs like Picasa, iPhoto, PhotoShop, iTunes, Mendeley, Papers, etc.
Another thought that comes to mind is that When one is dealing with large files (over 1 GB) It occurs to me that there is a reason for making a “preview” version of a couple of MB. That is if I have a 2 GB audio file, why not make 4 MB .mp3 file for rapid assessment of the file to see if it is worth downloading the .wav file. It seems that a metadata packager could also create a presentation file on the fly too. This is no-less true with photos or images. If a command-line tool could be used like imagemagick, that would be awesome.
This problem has been addressed in the open source library science world. In fact a nice piece of software does live out there. It is called the Metadata Extraction Tool. It is not an end-all for all of this archive’s needs but it is a solution for some needs of this type.

Matching IPTC to Dublin Core

Posted on July 20, 2011 by Hugh Paterson III

Go to start of metadata
Matching IPTC to Dublin Core:

http://metadatadeluxe.pbworks.com/w/page/25784393/W3C,-IPTC,-Dublin-Core,-and-Adobe

How this metadata stuff is stored

Posted on July 20, 2011 by Hugh Paterson III

This is an introduction to how this metadata is stored.

http://wiki.gbif.org/gbif/wikka.php?wakka=MMMetaData

XMP Sidecar Files
A sidecar file is an alternative to storing the metadata directly in the image file itself by instead storing the data in a separate .xmp file with the same base name as the photo. Sidecars are typically used in cases where the file format of the photo doesn't directly support embedding metadata or in cases when the image file should not be edited directly. It should be noted that very few programs support the reading the xmp sidecar files, most will default to reading and writing to the photo directly.

Gracefully copied from http://www.earlyinnovations.com/photolinker/xmp-sidecar-files.html

Note: that side car files are separate files from the photos so if that photo were to be archived it would need to form a package of two files a sidecar file and the main photo image.

Embedded Metadata
Exif
http://www.digicamhelp.com/glossary/exif-data/
http://www.opanda.com/en/iexif/
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html
http://www.stuffware.co.uk/cheese/

IPTC
http://www.iptc.org/std/photometadata/2008/specification/IPTC-PhotoMetadata-2008.pdf

The funky Dublin Core Metadata Schema
http://www.iptc.org/std/Iptc4xmpCore/1.0/specification/Iptc4xmpCore_1.0-spec-XMPSchema_8.pdf
http://www.prismstandard.org/about/relationships.asp

Ok, so the items which can be recorded in Exif data spots, might not be able to be recorded in ITCP spots and vise versa. That is that some elements like copyright holder or photographer can be recorded in the ITCP data but not in the Exif data. This means that there is not 100% correspondence between the two sets. We can not choose to use one and ignore the other. When we throw XMP into the mix there are attentional things which can be recorded in XMP but not in either of the other sets. Additionally, XMP is in its own file, not embedded. Dublin Core (DC) is also a set of options for metadata. They are not embedded in the photo itself, rather in a way it is a way to organize a database of metadata about objects. REAP uses DC. DC is extensible, that is we can move embedded metadata (or sidecars) from photos into the REAP database's metadata structure. But then what happens when the photo is removed from the REAP container. Does the metadata travel with it?

Here is a clip gracefully copied from http://www.earlyinnovations.com/photolinker/annotation-philosophy.html

Many popular websites and applications allow you to annotate your photos by adding keywords, a description, a title, a location, a list of the people in the photograph and many other tags. These websites and applications generally suffer from two major deficiencies:

Annotations are often exclusively added to a propriety database, and not written back to the photo. This means that unless the software or website is still available in, say, 50 years, the annotations will be completely lost.
Programs that do write the annotations directly to the image file usually corrupt existing tags or write partial information.
PhotoLinker solves both of these issues.

PhotoLinker write the annotations directly to the photo so that your annotations stay with the photo forever. After annotating with PhotoLinker you can use other programs or upload to popular websites with the knowledge that your annotations will stay with your copy of the photos.
PhotoLinker is one of first application to adhere to Metadata Working Group Guidelines for Handling Image Metadata. These guidelines ensure that annotations are handled correctly. In addition, PhotoLinker maintains transparency about how it handles the metadata by using open source tool ExifTool and showing exactly which tags are between read and written.

Photos and Metadata

Posted on July 19, 2011 by Hugh Paterson III

There are several things to think about with respect to photos and metadata.

1. The What: this is the elements of the metadata's data. The "Who", "What", "Where", "When", "Why and "How" of the photo.

2. The How: The technical storage of the metadata. Where is all of this data stored.

These two issues are discussed in their own child pages. There is a lot to say on each one.

In brief the What tries to answer the question What kind of meta-data should or can be collected?

Whereas the How tries to answer the question What should or can be be done with this meta-data?

File Naming convention: http://www.controlledvocabulary.com/imagedatabases/filenaming.html

Keywords: http://www.controlledvocabulary.com/metalogging/ck_guidelines.htm

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Category Archives: Meta-data

Citations, Names and Language Documentation

Language Documentation and the Datum

So, what am I talking about?

Letting Go

TIFFs, PDFs and OCR

So is OCR built into PDFs? or is there a need for independent OCR?

The better OCR option

Are all files used to create an orignal PDF included in the PDF?

Metadata Magic

Three Lingering Thoughts

Matching IPTC to Dublin Core

How this metadata stuff is stored

Photos and Metadata