Mode vs. Medium

Posted on January 5, 2023 by Hugh Paterson III

Two terms which seem to be very confusable to me are Medium and Mode.

Medium relates to Format and the carrier. whereas mode is more like the classification of mediums by how they are experience. Mode related to the mode of communication. For example, Visual, linguistics, spatial, aural, or gestural.

Mode is also not to be confused with mode of issuance, which relates to if the resource is released as a single unit or a multipart unit—often over time.

Digitization Services

Posted on December 31, 2011 by Hugh Paterson III

Over the last several months I have been looking for and comparing digitization services for audio, film, and for images (slides and more). I have been doing this as part of the ongoing work at the Language and Culture Archive to preserve the linguistic and cultural heritage of the people groups SIL International has encountered and served. I have not come to any hard and fast conclusions on “what is the best service provider”. This is partially because we are still looking at various out sourcing options and looking at multiple mediums is time consuming. Then there is also the issue of looking for archival standards and the creation of corporate policy for the digitization of these materials. I am presenting several names here as the results of several searches for digitization services providers.

Last month I was passed a short film on the BBC highlighting one of these providers. The short is well worth the watch because it highlights the reason and madness behind some of the work of digitization.

Profesional Services

Several of the companies which have come to the top of the list.

http://dijifi.com/ – Does the UN’s Collections
http://www.digmypics.com/ – does work for National Geographic
http://www.scancafe.com/ – Great consumer grade service

Doing it on our own

Another option the Archive has been looking at is to determine if the the quantity of the work is cost prohibitive to have professional done. Meaning that, we would be better served by buying the equipment and doing the work in house. So in the process I have also been looking at people’s experience with various kinds of equipment and technology used in scanning.

I have been reading a lot of user stories like Dave Dyer’s reflection on Slide Transfer and MacIntouch Reader Reports from 26 April 2006 on Slide Digitization.

Presentation version vs. Archival version of Digital Audio files

Posted on October 29, 2011 by Hugh Paterson III

What is an archival version of an audio file?

An archival version of an audio file is a file which represents the original sound faithfully. In archiving we want to keep a version of the audio which can be used to make other products and also be used directly itself if needed. This is usually done through PCM. There are several file types which are associated with PCM or RAW uncompressed faithful (to the original signal) digital audio. These are:

Standard Wave
AIFF
Wave 64
Broadcast Wave Format (BWF)One way to understand the difference between audio file formats is understanding how different format are used. One place which has been helpful to me has been the DOBBIN website as they explain their software and how it can change audio from one PCM based format to another.

Each one of these file types has the flexibility to have various kinds of components. i.e. several channels of audio can be in the same file. Or one can have .wav files with different bit depths or sampling rates. But they are each a archive friendly format. Before one says that a file is suitable for archiving simply based on its file format one must also consider things like sample rates, bit depth, embedded metadata, channels in the file, etc. I was introduced to DOBBIN as an application resource for audio archivists by a presentation by Rob Poretti. ^[1] Rob Poretti. 2011. Audio Analysis and Processing in Multi-Media File Formats. ARSC 2011. [Accessed: 24 October 2011] http://www.arsc-audio.org/conference/audio2011/extra/48-Poretti.pptx [Link] One additional thing that is worth noting in terms of archival versions of digital audio pertains to born digital materials. Sometimes audio is recored directly to a lossy compressed audio format. It would be entirely appropriate to archive a born-digital filetype based on the content. However it should be noted that in this case the recordings should have been done in a PCM file format.

What is a presentation version? (of an audio file)

A presentation version is a file created with a content use in mind. There are several general characteristics of this kind of file:

It is one that does not retain the whole PCM content.
It is usually designed for a specific application. (Use on a portable device, or personal audio player)
It can be thought of as a derivative product from an original audio or video stream.

In terms of file formats, there is not just one file format which is a presentation format. There are many formats. This is because there are many ways to use audio. For instance there are special audio file types optimized for various kinds of applications like:

3G and WiFi Audio and A/V services
Internet audio for streaming and download
Digital Radio
Digital Satellite and Cable
Portable playersA brief look a an explanation by Cube-Tec might help to get the gears moving. It is part of the inspiration for this post.

This means there is a long list of potential audio formats for the presentation form.

AAC (aac)
AC3 (ac3)
Amiga IFF/SVX8/SV16 (iff)
Apple/SGI (aiff/aifc)
Audio Visual Research (avr)
Berkeley/IRCAM/CARL (irca)
CDXA, like Video-CD (dat)
DTS (dts)
DVD-Video (ifo)
Ensoniq PARIS (paf)
FastTracker2 Extended (xi)
Flac (flac)
Matlab (mat)
Matroska (mkv/mka/mks)
Midi Sample dump Format (sds)
Monkey’s Audio (ape/mac)
Mpeg 1&2 container (mpeg/mpg/vob)
Mpeg 4 container (mp4)
Mpeg audio specific (mp2/mp3)
Mpeg video specific (mpgv/mpv/m1v/m2v)
Ogg (ogg/ogm)
Portable Voice format (pvf)
Quicktime (qt/mov)
Real (rm/rmvb/ra)
Riff (avi/wav)
Sound Designer 2 (sd2)
Sun/NeXT (au)
Windows Media (asf/wma/wmv)

Aside from just the file format difference in media files (.wav vs. .mp3) there are three other differences to be aware of:

Media stream quality variations
Media container formats
Possibilities with embedded metadata

Media stream quality variations

Within the same file type there might be a variation of quality of audio. For instance Mp3 files can have a variable rate encoding or they can have a steady rate of encoding. When they have a steady rate of encoding they can have a High or a low rate of encoding. WAV files can also have a high or a low bit depth and a high or a low sample rate. Some file types can have more channels than others. For instance AAC files can have up to 48 channels where as Mp3 files can only have up to 5.1 channels. ^[2]Various Contributors. 21 October 2011 at 21:44 . Wikipedia: Advanced Audio Coding, AAC’s improvements over MP3. http://en.wikipedia.org/wiki/Advanced_Audio_Coding#AAC.27s_improvements_over_MP3 … Continue reading

One argument I have heard in favor of saving disk space is to use lossless compression rather than WAV files for archive quality (and as archive version) recordings. As far as archiving is concerned, these lossless compression formats are still product oriented file formats. One thing to realize is that not every file format can hold the same kind of audio. Some formats have limits on the bit depth of the samples they can contain, or they have a limit on the number of audio channels they can have in a file. This is demonstrated in the table below, taken from wikipedia. ^[3]Various Contributors. 21 October 2011 at 10:26 . Wikipedia:Comparison of audio formats, Technical Details of Lossless Audio Compression Formats. … Continue reading This is where understanding the relationship between a file format, a file extension and a media container format is really important.

Audio compression format	Algorithm	Sample Rate	Bits per sample	Latency	Stereo	Multichannel
ALAC	Lossless	44.1 kHz to 192 kHz	16, 24^[41]	?	Yes	Yes
FLAC	Lossless	1 Hz to 655350 Hz	8, 16, 20, 24, (32)	4.3ms - 92ms (46.4ms typical)	Yes	Yes: Up to 8 channels
Monkey's Audio	Lossless	8, 11.025, 12, 16, 22.05, 24, 32, 44.1, 48 kHz	?	?	Yes	No
RealAudio Lossless	Lossless	Varies (see article)	Varies (see article)	Varies	Yes	Yes: Up to 6 channels
True Audio	Lossless	0–4 GHz	1 to > 64	?	Yes	Yes: Up to 65535 channels
WavPack Lossless	Lossless, Hybrid	1 Hz to 16.777216 MHz	varies in lossless mode; 2.2 minimum in lossy mode	?	Yes	Yes: Up to 256 channels
Windows Media Audio Lossless	Lossless	8, 11.025, 16, 22.05, 32, 44.1, 48, 88.2, 96 kHz	16, 24	>100ms	Yes	Yes:Up to 6 channels

Media container formats

Media container formats can look like file types but they really are containers of file types (think like a folder with an extension). Often they allow for the bundling of audio and video files with metadata and then enable this set of data to act like a single file. On wikipedia there is a really nicecomparison of container formats.

MP4 is one such container format. Apple Lossless data is stored within an MP4 container with the filename extension .m4a – this extension is also used by Apple for AAC audio data in an MP4 container (same container, different audio encoding). However, Apple Lossless is not a variant of AAC (which is a lossy format), but rather a distinct lossless format that uses linear prediction similar to other lossless codecs such as FLAC and Shorten. ^[4] Various Contributors. 6 October 2011 at 03:11. Wikipedia: Apple Lossless. http://en.wikipedia.org/wiki/Apple_Lossless [Link] Files with a .m4a generally do not have a video stream even though MP4 containers can also have a video stream.

MP4 can contain:

Video: MPEG-4 Part 10 (H.264) and MPEG-4 Part 2
Other compression formats are less used: MPEG-2 and MPEG-1
Audio: Advanced Audio Coding (AAC)
Also MPEG-4 Part 3 audio objects, such as Audio Lossless Coding (ALS), Scalable Lossless Coding (SLS), MP3, MPEG-1 Audio Layer II (MP2), MPEG-1 Audio Layer I (MP1), CELP, HVXC (speech), TwinVQ, Text To Speech Interface (TTSI) and Structured Audio Orchestra Language (SAOL)
Other compression formats are less used: Apple Lossless
Subtitles: MPEG-4 Timed Text (also known as 3GPP Timed Text).
Nero Digital uses DVD Video subtitles in MP4 files ^[5] Various Contributors. 11 October 2011 at 15:00. Wikipedia: MPEG-4 Part 14. http://en.wikipedia.org/wiki/.m4a [Link]

This means that an .mp3 file can be contained inside of an .mp4 file. This also means that audio files are not always what they seem to be on the surface. This is why I advocate for an archive of digital files which archives for a digital publishing house to also use technical metadata as discovery metadata. Filetype is not enough to know about a file.

Possibilities with embedded metadata

Audio files also very greatly on what kinds of embedded metadata and metadata formats they support. MPEG-7, BWF and MP4 all support embedded metadata. But this does not mean that audio players in the consumer market or prosumer market respect this embedded metadata. ARSC has in interesting report on the support for embedded metadata in audio recording software. ^[6]Chris Lacinak, Walter Forsber. 2011. A Study of Embedded Metadata Support in Audio Recording Software: Summary of Findings and Conclusion. ARSC Technical Committee. … Continue reading Aside from this disregard for embedded metadata there are various metadata formats which are embedded in different file types, one common type ID3, is popular with .mp3 files. But even ID3 comes in different versions.

In archiving Language and Culture Materials our complete package often includes audio but rarely is just audio. However, understanding the audio components of the complete package help us understand what it needs to look like in the archive. In my experience in working with the Language and Culture Archive most contributors are not aware of the difference between Archival and Presentation versions of audio formats and those who think they do, generally are not aware of the differences in codecs used (sometimes with the same file extension). From the archive’s perspective this is a continual point of user/submitter education. This past week have taken the time to listen to a few presentations by Audio Archivist from the 2011 ARSC convention. These in general show that the kinds of issues that I have been dealing with in the Language and Culture Archive are not unique to our context.

Anthony Seeger, Maureen Russell, David Martinelli. Ethnographic Sound Archives.http://www.arsc-audio.org/conference/audio2011/mp3/14.mp3 [Accessed 24 Oct. 2011]
Wendy Sistrunk, Sandy Rodriguez. The Goldin Transcription Collection at UMKC. http://www.arsc-audio.org/conference/audio2011/mp3/16.mp3 [Accessed 24 Oct. 2011] [PDF visual of presentation]
Birgitta Johnson. Gospel music in L.A.http://www.arsc-audio.org/conference/audio2011/mp3/39.mp3 [Accessed 24 Oct. 2011]

The Complete Audio Package

References[+]

References
↑1	Rob Poretti. 2011. Audio Analysis and Processing in Multi-Media File Formats. ARSC 2011. [Accessed: 24 October 2011] http://www.arsc-audio.org/conference/audio2011/extra/48-Poretti.pptx [Link]
↑2	Various Contributors. 21 October 2011 at 21:44 . Wikipedia: Advanced Audio Coding, AAC’s improvements over MP3. http://en.wikipedia.org/wiki/Advanced_Audio_Coding#AAC.27s_improvements_over_MP3 [Link]
↑3	Various Contributors. 21 October 2011 at 10:26 . Wikipedia:Comparison of audio formats, Technical Details of Lossless Audio Compression Formats. http://en.wikipedia.org/wiki/Comparison_of_audio_codecs#Technical_Details_of_Lossless_Audio_Compression_Formats [Link]
↑4	Various Contributors. 6 October 2011 at 03:11. Wikipedia: Apple Lossless. http://en.wikipedia.org/wiki/Apple_Lossless [Link]
↑5	Various Contributors. 11 October 2011 at 15:00. Wikipedia: MPEG-4 Part 14. http://en.wikipedia.org/wiki/.m4a [Link]
↑6	Chris Lacinak, Walter Forsber. 2011. A Study of Embedded Metadata Support in Audio Recording Software: Summary of Findings and Conclusion. ARSC Technical Committee. http://www.arsc-audio.org/pdf/ARSC_TC_MD_Study.pdf [Link]

Citations, Names and Language Documentation

Posted on September 30, 2011 by Hugh Paterson III

I have recently been reading the blog of Martin Fenner and came upon the article Personal names around the world ^[1] Martin Fenner. 14 August 2011. Personal names around the world. PLoS Blog Network. http://blogs.plos.org/mfenner/2011/08/14/personal-names-around-the-world . [Accessed: 16 September 2011]. [Link] . His post is in fact a reflection on a W3C paper on Personal Names around the WorldSeveral other reflections are here: http://www.w3.org/International/wiki/Personal_names (same title). This is apparently coming out of the i18n effort and is an effort to help authors and database designers make informed decisions about names on the web.
I read Martin’s post with some interest because in Language Documentation getting someone’s name as a source or for informed consent is very important (from a U.S. context). Working in a archive dealing with language materials, I see lot of names. One of the interesting situations which came to me from an Ecuadorian context was different from what I have seen in the w3.org paper or in the w3.org discussion. The naming convention went like this:

The elder was known by the younger’s name plus a relationship.

My suspicion is that it is a taboo to name the dead. So to avoid possibly naming the dead, the younger was referenced and the the relationship was invoked. This affected me in the archive as I am supposed to note who the speaker is on the recordings. In lue of the speakers name, I have the young son’s first name, who is well known in the community, and is in his 30’s or so, and I have the relationship. So in English this might sound like John’s mother. Now what am I supposed to put in the metadata record for the audio recordings I am cataloging? I do not have a name but I do have a relationship to a known (to the community) person.

I inquired with a literacy consultant who has worked in Ecuador with indigenous people for some years, she informed me that in one context she was working in everyone knew what family line they were from and all the names were derived from that family line by position. It was of such that to call someone by there name was an insult.

It sort of reminds me of this sketch by Fry and Laurie.

References[+]

References
↑1	Martin Fenner. 14 August 2011. Personal names around the world. PLoS Blog Network. http://blogs.plos.org/mfenner/2011/08/14/personal-names-around-the-world . [Accessed: 16 September 2011]. [Link]

Smart Lists and UI

Posted on September 27, 2011 by Hugh Paterson III

Working in an archive, I deal with a lot of metadata. Some of this metadata is from controlled vocabularies. Sometimes they show up in lists. Some times these controlled vocabularies can be very large, like for the names of language where there are a limited amount of languages but the amount is just over 7,000. I like to keep an eye out for how websites optimized the options for users. FaceBook, has a pretty cool feature for narrowing down the list of possible family relationships someone has to you. i.e. a sibling could be a brother/sister, step-brother/step-sister, or a half-brother/half-sister. But if the sibling is male, it can only be a brother, step-brother, or a half-brother.

FaceBook narrows the logical selection down based on atributes of the person mentioned in the relationship.

All the relationship options.

That is if I select Becky, my wife, as an person to be in a relationship with me then FaceBook determines that based on her gender atribute that she can only be referenced by the female relationships.

Menu showing just some relationships based on an atribute of the person referenced.

The iPad Team

Posted on September 16, 2011 by Hugh Paterson III

The Concept

I have had some ideas I wanted to try out for using the iPad as a tool for collecting photo metadata. Working in a corporate archive, I have become aware of several collections of photos without much metadata associated with them.

The photos are the property of (or are in the custodial care of) the company I work at (in their corporate archive).

The subject of the photos are either of two general areas:

The minority language speaking people groups that the employees of a particular company worked with, including anthropological topics like ways of life, etc.
Photos of operational events significant to telling the story of the company holding the photos.

Archives in more modern contexts are trying to show their relevance to not only academics, but also to general members of communities. In the United States there is a whole movement of social history. There are community preservation societies which take on the task of collecting old photographs and their stories and preserving, and presenting them for future generations.

The challenge at hand is: "How do we enrich photos by adding metadata to photos in the collections of archives?" There are many solutions to this kind of task. The refining, distilling, and reduction of stories and memories to writing and even to metadata fields is no easy task, nor is it a task that one person can do on their own. One solution, which is often employed by community historians is the personal interview. By interviewing the photographers or people who were at an event and asking them questions about a series of photos it presents an atmosphere of inquisitiveness and one where the story-teller is valued because they have a story-listener. This basic personal connection allows for interactions to occur across generational and technological barriers.

The crucial question is: "How do we facilitate an interaction which is positive for all the parties involved?" The effort and thinking behind answering this question has more to do with shaping human interactions than with anything else. We are also talking about using technology in this interaction. This is true UX or (User Experience).

Past Experience

This past summer I have had several experiences with facilitating one-on-one interactions between knowledgeable parties working with photographs and with someone acting on behalf of the corporate archive. To facilitate this interaction a GoogleDoc Spreadsheet was set up and the person acting on the behalf of the archive was granted access to the spreadsheet. The individual conducting the interview and listening to the stories brought their own netbook (small laptop) from which to enter any collected data. They were also given a photo album full of photos, which the interviewee would look through. This set-up required overcoming several local environmental challenges. As discussed below, some of these challenges were better addressed than others.

Association of Data to a Given Photo

The challenge of keeping up to 150 photos organized durring an interview so that metadata about any given photo could be collected and associated with only that photo. This was addressed by adhering an inventory sticker to the back of each photo and assigning each photo a single row in the GoogleDoc Spreadsheet. Using GoogleDocs was not the ideal solution, but rather than a solution of some compromises:

Strengths of GoogleDocs

One of the great things about GoogleDocs is that the capability exists for multiple people to edit the spreadsheet simultaneously.
Another strength of GoogleDocs is that there is a side bar chat feature so that if there is a question durring the interview that help could be had very quickly from management (me, who was offsite).
The Data can be exported in the following formats: .xlsx , .xls , .csv , .pdf.
There was no cost to deploy the technology.
It is accessible through a web-browser in an OS neutral manner.
The document is available wherever the internet is available.
A single solution could be deployed and used by people digitizing photos, recording written metadata on the photos, and gathering metadata during an interview.
Most people acting on behalf of the archive were familiar with the technology.

Pitfalls of GoogleDocs

More columns exist in the spread sheet than can be practically managed (The columns are presented below in a table). There are about 48 values in a record and there are about 40,000 records.

More columns than can be practically managed

Does not display the various levels of data as levels of data as levels in the user interface.
Cannot remove unnecessary fields from the UI of various people. (No role-based support.)
Only available when there is internet.

Maximizing of Interview Time

To maximize time spent with the interviewee the photos and any metadata written or known about a photo was put into the GoogleDoc Spreadsheet prior to the interview. Sometimes this was not done by the interviewer but rather by someone else working on behalf of the archive. Durring the interview the interviewer could tell which data fields were empty by looking for the gray cells in the spreadsheet. However, just because the cells were did not mean that the interviewee was more prone to provide the desired, unknown, information.

Grey Areas Show Metadata fields which are empty

Data Input Challenges

One unanticipated challenge which was encountered in the interviews was that as the interviewer would bring out an album or two of photos that the interviewees would be able to cover more photos than the interviewer could record.

Let me spell it out. There is one interviewer and two interviewees there are 150 photos in an album lying open on the table. All three participants are looking at the photo album. The interviewee A says look that is so-and-so and then interviewee B (because the other page is closer to them) says and this is so-and-so! This happens for about 8 of the 12 facing photos. Because the interviewer is still typing the first name mentioned they ask and when do you think that was? But the metadata still comes in faster, as the second interviewee did not hear the question and the first one did but still thinking. The bottom line is that more photos are viewed and commented on faster than can be recorded.

Something that could help this process would be to in some way to slow-down (or moderate) the ability of the interviewee(s) to access the photos. Something that could synchronize the processing times with the viewing times. By scanning the photos and then displaying them on a tablet it slows down the viewing process and integrates the recording of data with the viewing of photos.

Positional Interaction Challenges

An interview is, at some level, an interaction. One question which comes up is How does the technology used affect that interaction? What we found was that a laptop usually was situated between the interviewer and the interviewees. This positioned the parties in an apposing manner. Rather than the content becoming the central focus of both parties, the content was either in front of the interviewer or in front of the interviewees. A tablet changes this dynamic in the interaction. It brings both parties together over a single set of content, both positionally and cognitively. When the photo is displayed on the laptop, the laptop has to be rotated so that the interviewees can see the image and then turned so that the interviewer can input the data. This is not the case for a tablet.

Content Management Challenges

When Paper is used for collecting metadata it is ideal to have one piece of paper for each photo. Sometimes this method is preferable to using a single computer. I used this method when I had a photo display and about 20 albums and about 200 people all filling out details at once.

People filling out metadata forms infront of a photo display.

People came and went as they pleased. When someone recognized someone or someplace they knew, they wrote down the picture ID and the info they were contributing along with their name. However, carrying around photo albums and paper there is the challenge of keeping all the photos from getting damaged, and maintaining the order of the photos and associated papers.

Connectivity Challenges

When there is no internet there is no access to GoogleDocs. We encountered this when we went to someone's apartment, expecting interent because the interent is available on campus and this apartment was also on campus. Fortunately we did have a back up plan and paper pen was used. But this means that we now had to type out the data, which was written down on the paper; in effect doing the same recording work twice.

Size of Devices

Photo albums have a certain bulk and cumbersome-ness which is multiplied when carrying more than one album at a time. Add to this a computer laptop and one might as well add to the list of required items, a hand truck with which to carry everything. A tablet is all in all a lot smaller and lighter.

</ref></p> " data-medium-file="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?fit=300%2C198&ssl=1" data-large-file="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?fit=584%2C386&ssl=1" src="https://hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg" alt="Laptop and Tablet" width="620" height="410" class="size-full wp-image-2804" srcset="https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?resize=620%2C410&ssl=1 620w, https://i0.wp.com/hugh.thejourneyler.org/wordpress/wp-content/uploads/2011/09/laptop-vs-tablet-620x410.jpg?resize=300%2C198&ssl=1 300w" sizes="auto, (max-width: 584px) 100vw, 584px" />

Laptop and TabletThis image is credited to Alia Haley ^[2] Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [Link]

Proof of Concept Technology

As I mentioned before, I had an iPad in my possession for a few days. So to capitalize on the opportunity, I bought a few apps from the app store, as I mentioned that I would and tried them out.

Software which does not work for our purposes

Photoforge2
The first app I tried was Photoforge2. It is a highly rated app in the app store. I found that it delivered as promised. One could add or edit the IPTC and EXIF metadata. One could even edit where the photo was taken with a pin drop interface.

iPad Fotoforge Location Data

iPad Fotoforge Metadata Editor

Meta Editor
Meta Editor, another iPad app, which was also highly acclaimed performed task almost as well. Photoforge2 had some photo editing features which were not needed in our project. Whereas Meta Editor was focused only on metadata elements.

MetadataEditor Location Data

After using both applications it became apparent that neither would work for this project for at least two reasons:

Both applications edit the Standards based IPTC and EXIF metadata fields in photos. We have some custom metadata which does not fit into either of these fileds.One aspect of the technology being discussed, which might be helpful for readers to understand, is that these iPad applications actually embed the metadata into the photos. So when the photos are then taken off of the iPad the metadata travels with them. This is a desirable feature for presentation photos.
Even if we do embed the metadata with these apps the version of the photo being enriched is not the Archival version of the photo it is the Presentation version of the photo. We still need the data to become associated with the archival version of the photo.

Software with some really functional features

So we needed something with a mechanism for capturing our customized data. Two options were found which seemed to avail themselves as suitable for the task. One was ideal the other rapidly deployable. Understanding the iPads' place in the larger place of corporate architecture, relationship to the digital repository, the process of data flow from the point of collection to dissemination, will help us to visualize the particular challenges that the iPad presents solutions for. Once we see where the iPad sits in relationship to the rest of the digital landscape I think it will be fairly obvious why one solution is ideal and the other rapidly deployable.

Placement in the iPad in the Information Architecture Picture

In my previous post on Social Metadata Collection ^[3]Hugh J. Paterson III. 29 June 2011. The Journeyler. [Accessed: 13 September 2011] https://hugh.thejourneyler.org/social-meta-data-collection. [Link] I used the below image to show where the iPad was used in the metadata collection process.

Meta-data Collection Model

Since that time, as I have shown this image when I talk about this idea, I have become aware that the image is not detailed enough. Because it is not detailed enough it can lead to some wrong assumptions on how the iPad use being proposed actually works. So, I am presenting a new image with a greater level of detail to show how the iPad interacts with other corporate systems and workflows.

iPad Team as they fit with other digital elements.

There are several things to note here:

Member Disporia as represented here is not just members, it is their families, the people with whom these members worked, it is the members currently working and it the members living close at hand on campus, not just in disporia.
It is a copy of the presentation file which is pushed out to the iPad or the website for the Member Disporia. This copy of the file does not necessarily need to be brought back to the archive as long as the metadata is synced back appropriately.
The Institutional Repository for other corporate items is currently in a DSpace instance. However, it has not been decided for sure that photos will be housed in this same instance, or even in DSpace.

That said, it is important that the metadata be embedded in the presentation file of the image, as well as accessible to the Main container for the archival of the photos. The metadata also needs to sync between the iPad application and the Member Diaspora website. Metadata truly needs to flow through the entire system.

FileMaker Pro with File Maker Go

FileMaker Pro is a powerful database app. It could drive the Member Disporia website and then also sync with the iPad. This would be a one-stop solution and therefore and ideal solution. It is also complex and takes more skill to set up than I currently have, or I can currently spare to acquire. Both FileMaker Pro and its younger cousin Bento enable Photos to be embedded in the actual database.Several tips from the Bento forums on syncing photos which are part of the database:
Syncing pictures from Bento-Mac to Bento-iPad
Sync multiple photos or files from desktop to IPad
This is something which is important with regards to syncing with the iPad. To the best of my knowledge (and googling) no other database apps for the iPad or Android platforms allow for the syncing of photos within the app.

Bento
Bento is the rapidly deployable option.What are the differences between Bento 4 for Mac, Bento for iPad 1.1.x, and Bento for iPhone/iPod touch 1.1.x?
It took me about 2 hours (while doing other stuff) to download a trial version, find out how it worked, import my data from the GoogleDoc and then sync my database with the iPad.

Here is a YouTube video demonstrating my proof of concept using Bento.

httpv://youtu.be/_Eo5Ru0BF-k

Here is a series of iPad Screen shots.

Screen Long ways

Inputing Data

Data and Photo seen together.

Some outstanding issues

Geo-location of Photos in Bento. Bento version 4 does have location fileds which can be used with a pin drop interface to add location data to the appropiate fileds in the database. My proof of concept demo does not demonstrate this feature.Using Geo-location fields in Bento: Working with Location Fields in Bento
How to use Location fields in Bento for iPhone/iPad 1.1.1
Rapid reuse of data. Because the interview process naturally lends itself to eliciting the same kind of data over a multitude of photos a UX/UI element which allows the rapid reuse of data would be very practical. The kinds of data which would lend themselves to rapid reuse would be peoples' names, locations, dates, photographer, etc. This may mean being able to query a table of already input'd data values with an auto-suggest type function.

Custom iPad App

Of course there is also the option to develop a custom iPad app for just our purposes. This entails some other kinds of planning, including but not limited to:

Custom App development
Support plan
Deploy or develop possible Web-backend - if needed.

Kinds of custom metadata being collected.

The table in this section shows the kinds of questions we are asking in our interviews. It is not only provided for reference as a discussion of the Information Architecture for the storage and elements of the metadata schema is out of the scope of this discussion. The list of questions and values presented in the table was derived as a minimal set of questions based on issues of Image Workflow Processing, Intelectual Property and Permissions, Academic Merit and input from the controlled vocabulary's Caption and Keywording Guidelines ^[4] Controlled Vocabulary. Caption and Keywording Guidelines. [Accessed: 13 September 2011] http://www.controlledvocabulary.com/metalogging/ck_guidelines.html. [Link] which is part of their series on metalogging. The table also shows corresponding IPTC, and EXIF data fields. (Though they are currently empty because I have not filed them in.) Understanding the relationships of XMP, IPTC, and EXIF also help us to understand why and how the iPad tool needs to interact with other Archiving solutions. However, it is not within the scope of this post to discuss these differences.Some useful resources on these issues are noted here:

Photolinker Metadata Tags ^[5] Early Innovations, LLC. 2011. Photolinker Metadata Tags. [Accessed: 13 September 2011] http://www.earlyinnovations.com/photolinker/metadata-tags.html. [Link] has a nice display outlining where XMP, IPTC and EXIF data overlap. This is not authoritative, but rather practical.
List of IPTC fields: List of IPTC fields. However, a list is not enough we also need to know what they mean so that we know that we are using them correctly.
EXIF and IPTC Header Comments. Here is another list of IPTC fileds. This list also includes a list of list of EXIF fileds. (Again without definitions.)
Various programs and applications also add their own metadata fields in the IPTC section. Here is a mapping of some of the most popular ones: http://www.controlledvocabulary.com/imagedatabases/iptc_core_mapped.pdf
IPTC Standard Photo Metadata ^[6]David Riecks. 2010. IPTC Standard Photo Metadata (July 2010). International Press Telecommunications Council. [Accessed: 13 September 2011] … Continue reading http://www.iptc.org/std/photometadata/documentation/IPTC-PLUS-Metadata-Panel-UserGuide_6.pdf
Doublin Core with Photographs: http://makeit.digitalnz.org/askaquestion/questions/26
Dublin Core Metadata Element Set, Version 1.1: http://dublincore.org/documents/dces/
DCMI Type Vocabulary: http://dublincore.org/documents/dcmi-type-vocabulary/
Describing Digital Content: http://makeit.digitalnz.org/guidelines/describing-digital-content/

It is sufficient to note that there is some, and only some overlap.

Metadata Element	Purpose	Explanation
Photo Collection	This is the name of the collection in which the photos reside
Sub Collection	This is the name of the sub collection in which the photos reside
Letter of Collection	Each collection is given an alpha character or a series of alpha characters, if the collection pertains to one people group then the alpha characters given to that collection are the three digit ISO 639-3 code
Who input the Meta-data	This is the name of the person inputting the metadata
Photo Number	This is the number of the photo as we have inventoried the photo
Negative Number	This is the number of the photo as it appears on the negative (film strip)
Roll	This is the ID of the Roll	Most sets of negatives are cut into strips of 5 or less this allows us to group these sets together to ID a “set” of photos
Section Number	If the items are in a book or a scrap book and that scrap book has a section this is where that is recoreded
Page#	If a scrap book has a set of pages then this is where they are recoreded
Duplicates	This is where the Photo ID of a duplicate item is referenced.
Old Inventory Number(s)	This is the inventory number of an item if it were part of another invenotry system
Photographer	This is the name of the photographer
Subject 1 (who)	Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 2		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 3		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 4		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
Subject 5		Who is in the photo, this should be an unlimited field. That is sveral names should be able to be added to this.
People group	This is the name of the people group meneined in the ISO 639-3 codes
ISO 639-3 Code	This is the ISO 639-3 code of the people group being photographed
When was the photo Taken?	The date the photo was taken
Country	The country in which the photo was taken
District/City	This is the City where the photo was taken
Exact Place	The exact place name where the photo was taken
What is in the Photo (what)	This is an item in the photo
What is in the Photo	Additional what is in the photo
What is in the Photo	Addtional what is in the photo
Why was the Photo Taken?		This is to help metadata providers think about how events get communicated
Description	This is a description of the photo’s contents	This is not a caption but could be used as a caption
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
Who Provided This Meta-Data? And when?		We need to keep track of who is the source of certain metadata to understand its authority
I am in this photo and I approve it to be on the internet. Put in "yes" or "No" and write your name in the next column.	Permission to distribute
Name:	Name of the person releasing the photo
How was this photo digitized?	Method of digitization and the tools used in digitization
Who digitized This photo	This is the name of the person who did the digitization

References[+]

References
↑1	Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [<a href="http://churchm.ag/tablet-vs-laptop" title="Tablet vs Laptop">Link</a>]
↑2	Alia Haley. 31 August 2011. Tablet vs. Laptop. Church Mag. [Accessed: 11 September 2011] http://churchm.ag/tablet-vs-laptop. [Link]
↑3	Hugh J. Paterson III. 29 June 2011. The Journeyler. [Accessed: 13 September 2011] https://hugh.thejourneyler.org/social-meta-data-collection. [Link]
↑4	Controlled Vocabulary. Caption and Keywording Guidelines. [Accessed: 13 September 2011] http://www.controlledvocabulary.com/metalogging/ck_guidelines.html. [Link]
↑5	Early Innovations, LLC. 2011. Photolinker Metadata Tags. [Accessed: 13 September 2011] http://www.earlyinnovations.com/photolinker/metadata-tags.html. [Link]
↑6	David Riecks. 2010. IPTC Standard Photo Metadata (July 2010). International Press Telecommunications Council. [Accessed: 13 September 2011] http://www.iptc.org/std/photometadata/documentation/IPTC-PLUS-Metadata-Panel-UserGuide_6.pdf [Link]

Letting Go

Posted on August 29, 2011 by Hugh Paterson III

Working in an archive, one can imagine that letting go of materials is a real challenge. Both in that it is hard to do becasue of policy, but also because it is hard to do because of the emotional “pack-rat” nature of archivist. This is no less the case of the archive where I work. We were recently working through a set of items and getting rid of the duplicates. (Physical space has its price; and the work should soon be available via JASOR.) However, one of the items we were getting rid of was a journal issue on a people group/language. The journal has three articles, of these, only one of them article was written by someone who worked for the same organization I am working for now. So the “employer” and owner-operator of the archive only has rights to one of the three works. (Rights by virtue of “work-for-hire” laws.) We have the the off-print, which is what we have rights to share, so we keep and share that. It all makes sense. However, what we keep is catalogued and inventoried. Our catalogue is shared with the world via OLAC. With this tool someone can search for a resource on a language, by language. It occurs to me that the other two articles on this people group/language will not show in the aggregation of results of OLAC. This is a shame as it would be really helpful in many ways. I wish there was a groundswell, open source, grassroots web facilitated effort where various researchers can go and put metadata (citations) of articles and then they would be added to the OLAC search.

TIFFs, PDFs and OCR

Posted on August 16, 2011 by Hugh Paterson III

In the course of my experience I have been asked about PDFs and OCR several times. The questions usually follow the main two questions of this post.

So is OCR built into PDFs? or is there a need for independent OCR?

In particular an image based PDF, is it searchable?

The Short answer is Yes. Adobe Acrobat Pro has an OCR function built in. And to the second part: No, an image is not searchable. But what can happen is that Adobe Acrobat Pro can perform an OCR function to an image such as a .tiff file and then add a layer of text, (the out put of the OCR process) behind the image. Then when the PDF is searched it actually searches the text layer which is behind the image and tries to find the match. The OCR process is usually between 80-90% accurate on texts in english. This is usually good enough for finding words or partial words.

The Data Conversion Laboratory has a really nice and detailed write up on the process of converting from images to text with Adobe Acrobat Pro.

Daily Designer has a tutorial on how to do it on OS X.
David R. Mankin explains on his blog what the process looks like using Windows.

One of the beauties of Adobe Acrobat Pro is that this process can be scripted and the TIFFs processed in batches.
[On Windows] :: [On OS X using AppleScript] :: [Cross platform help from Adobe]

University Illinois Chicago explains how to do use Adobe Acrobat Pro and OCR with a scanner using a TWAIN driver.

The better OCR option

Since I work in an industry where we are dealing with multiple languages and the need to professionally OCR thousands of documents I thought I would provide a few links on the comparison of OCR software on the market.
Lifehacker has short write up of the top five OCR tools.

Of those top 5, in this article, two, ABBYY Fine Reader and Adobe Acrobat are compared side by side on both OS X and Windows.

Are all files used to create an orignal PDF included in the PDF?

One thing to remember, Which I have said before, is that not all PDFs are created equal. This Manual talks a bit about different settings inside of PDFs when using Adobe’s PDF printer.

The Short answer is No. But the long answer is Yes. Depending on the settings of the PDF creator the original files might be altered before they are wrapped in a PDF wrapper.

So the objection, usually in the form of a question sometimes comes up:

Is the PDF file just using the PDF framework as a wrapper around the original content? Therefore, to archive things “properly” do I still need to keep the .tiff images if they are included in the PDF document?

The answer is: “it depends”. It depends on several things, one of which is, what program created the PDF and how it created the PDF. – Did it send the document through PostScript first? Another thing that it depends on is what else might one want to do with the .tiff files?

In an archiving mentality, the real question is: “Should the .tiff files also be saved?” The best practice answer is Yes. The reason is that the PDF is viewed as a presentation version and the .tiff files are views as the digital “originals”.

Social Meta-data collection

Posted on June 29, 2011 by Hugh Paterson III

As part of my job I work with materials created by the company I work for, that is the archived materials. We have several collections of photos by people from around the world. In fact we might have as many as 40,000 photos, slides, and Negatives. Unfortunately most of these images have no Meta-data associated with them. It just happens to be the case that many of the retirees from our company still live around or volunteer in the offices. Much of the meta-data for these images lives in the minds of these retirees. Each image tells a story. As an archivist I want to be able to tell that story to many people. I do not know what that story is. I need to be able to sit down and listen to that story and make the notes on each photo. This is time consuming. More time consuming than I have.

Here is the Data I need to minimally collect:

Photo ID Number: ______________________________
Who (photographer): ____________________________
Who (subject): ________________________________
People group:_________________________________
When (was the photo taken): _______________________
Where (Country): _______________________________
Where (City): _________________________________
Where (Place): ________________________________
What is in the Photo: ____________________________
Why was the photo taken (At what event):_________________________
Photo Description:__short story or caption___
Who (provided the Meta-data): _________________________

Here is my idea: Have 2 volunteers with iPads sit down with the retirees and show these pictures on the iPads to the retirees and then start collecting the data. The iPad app needs to be able to display the photos and then be able to allow the user to answer the above questions quickly and easily.

One app which has a really nice UI for editing photos is PhotoForge. [Review].

The iPad is only the first step though. The iPad works in one-on-one sessions working with one person at a time. Part of the overall strategy needs to be a cloud sourcing effort of meta-data collection. To implement this there needs to be central point of access where interested parties can have a many to one relationship with the content. This community added meta-data may have to be kept in a separate taxonomy until it can be verified by a curator, but there should be no reason that this community added meta-data can not be expected to be valid.

Meta-data Collection Model

However, what the app needs to do is more inline with MetaEditor 3.0. MetaEditor actually edits the IPTC tags in the photos – Allowing the meta-data to travel with the images.In one sense adding meta-data to an image is annotating an image. But this is something completely different than what Photo Annotate does to images.

Photo Annotate

Photosmith seems to be a move in the right direction, but it is focused on working with Lightroom. Not with a social media platform like Gallery2 & Gallery3, Flickr or CopperMine.While looking at open source photo CMS’s one of the things we have to be aware of is that meta-data needs to come back to the archive in a doublin core “markup”. That is it needs to be mapped and integrated with our current DC aware meta-data scehma. So I looked into modules that make Gallery and Drupal “DC aware”. One of the challenges is that there are many photo management modules for drupal. None of them will do all we want and some of them will do what we want more elegantly (in a Code is Poetry sense). In drupal it is possible that several modules might do what we want. But what is still needed is a theme which elegantly, and intuitively pulls together the users, the content, the questions and the answers. No theme will do what we want out of the box. This is where Form, Function, Design and Development all come together – and each case, especially ours is unique.

This, cloud sourcing of meta-data model has been implemented by the Library of Congress in the Chronicling America project. Where the Library of Congress is putting images out on Flickr and the public is annotating (or “enriching” or “tagging” ) them. Flickr has something called Machine Tags, which are also used to enrich the content.

There are two challenges though which still remain:

How do we sync offline iPad enriched photos with online hosted images?
How do we sync the public face of the hosted images to the authoritative source for the images in the archive’s files?

All PDFs are not created Equal

Posted on February 18, 2010 by Hugh Paterson III

Many of us use PDFs every day. They’re great documents to work with and read from because of their ease of use and ease to create.

I think I started to use PDFs for the first time in 2004. That’s when I got my first computer. Since that time, most PDFs I have needed to use have just worked. In the time that I have been using PDFs I have noticed that there are at least two major ways in which PDFs are not created equally:

Validity of the PDF: Adherence to the PDF document standard.
Resolution of contained images
The presence and accuracy of the PDF’s meta-data.

Validity

Since 2004, there have only been a few PDFs which after creation and distribution would not render by any of my PDF readers, or on the readers my friends used (most of these PDFs were created by Microsoft Word or Microsoft Publisher on Windows and actually one or two created by Apple’s word processor Pages). Sometimes these errors had to do with a particular image included in the source document. The image may have been malformed, but this was not always the case. Sometimes it was the PDF creator, which was creating non-cross-platform PDFs.

Not all PDFs are created equal. (This is inherently true when one considers the PDF/A The University of Michigan put a small flyer together on how to get something like a PDF/A to print from MS Word on OS X and Windows. [Link], and PDF/X standards, however lets side-step those standards for a moment.) To frame this discussion, it is necessary to acknowledge that there is a difference between creating a digital document with a life expectancy of 3 weeks and one with a life expectancy of 150 years. So for some applications, what I am about to say is a moot point. However, looking towards the long term…

If an archival institution wants a document as a PDF, what are the requirements that that PDF needs to have?

What if the source document is not in Unicode? Is the font used in the source document automatically embedded in the PDF upon PDF creation? Consider this from PDFzone.com

Embedding Fonts in a PDF
Another common area of complaint among frequent PDF users is font incompatibility and problems with font embedding. Here are some solutions and tips for putting your best font forward so to speak.

Keep in mind that when it comes to embedding fonts in a PDF file you have to make certain that you have the correct fonts on the system you’re using to make the conversion. Typically you should embed and subset fonts, although there are always exceptions.

If you just need a simple solution that will handle the heavy font work for you, the WonderSoft Virtual PDF Printer helps you choose and embed your fonts into any PDF. The program supports True Type and Unicode fonts.

The left viewing window shows you all the fonts installed on your system and the right viewing window shows the selected user fonts to embed into a newly created PDF form. A single license is $89.95.

Another common solution is the 3-Heights Optimization PDF Optimization Tool [Link Removed].

One of the best sources of information on all things font is at the Adobe site itself under the Developer Resources section.

3-Heights does have an enterprise level PDF validator. I am not sure if there is one in the OpenSource world But it would seem to me that any Archival Institution should be concerned with not just having PDFs in their archive but also keenly interested in having valid PDFs in their archives. This is especially true when we consider that one of todays major security loopholes is malformed file types, i.e. PDFs that are not really PDFs or PDFs with something malicious attached or embeddedHere is a nice Blog Post about embedding a DLL in a PDF. I am sure that there is more than one method to this madness but it only takes one successful attempt to create a security breach. In fact there are several methods reported, some with javascript some without. Here are a few:

Apparently, several kinds of media can be embed in PDFs. These include: movies and songs, JavaScript, and forms that upload data a user inputs to a web server within PDFs. And there’s no forgetting the function within PDF specs to launch executables..

Printing the PDF does not seem to be a fail proof method to see if the PDF is valid or even usable. See this write up from The University of Sydney School of Mathematics and Statistics:

Problem
I can read the file OK on screen, but it doesn’t print properly. Some characters are missing (often minus signs) and sometimes incorrect characters appear. What can I do?
Solution
When the Acrobat printing dialog box opens, put a tick in the box alongside “print as image”. This will make the printing a lot slower, but should solve the problem. (The “missing minus signs” problem seemed to occur for certain – by now rather old – HP printers.)
(Most of these problems with pdf files are caused by subtle errors with the fonts the pdf file uses. Unfortunately, there are only a limited number of fonts that supply the characters needed for files that involve a lot of mathematics.)

Printing a PDF is not necessarily a fail proof way to see if a PDF is usable. Even if the PDF is usable, printing the PDF does not ensure that it is a valid PDF either. When considering printing as a fail proof method one should also consider that PDFs can contain video, audio, and flash content. So how is one to print this material? Or in an archival context determine that the PDF is truly usable? A valid PDF will render and archive correctly because it conforms to the PDF standard (what ever version of that standard is declared in the PDF). Having a PDF conform to a given PDF standard puts the onus on the creator of the PDF viewer (software) to implement the PDF standard correctly. Thus making the PDF usable (as intended by the PDF’s author).

Note: I have not checked the Digital Curation Center for any recommendations on PDFs and ensuring their validity on acceptance to an archive.

Resolution of Contained Images

A second way that PDF documents can vary is that the resolutions of the images contained in them can vary considerably. The images inside of a PDF can be a variety of image formats, .jpg, .tiff, .png, etc. So the types of compression and the looseness of these compressions can make a difference in the archival “quality” of a PDF. A similar difference is is noted to be the difference in a raster PDF and a Vector PDF. ^[1]Yishai . 1 July 2009. All PDF’s are not created equal. Part III (out of III). Digilabs Technologies Blog. … Continue reading Beside these two types of differences there are various PDF printers, which print materials to PDF in various formats. This manual discusses setting Adobe Acrobat Pro’s included PDF printer.

Meta-data

A third way in which PDFs are not created equally is that they do not all contain valid, and accurate meta-data, it the meta-data containers available in the PDF standard. PDF generators do not all respectfully add meta-data to right places in a PDF file, and those which do sometimes add meta-data to a PDF file do not always add the correct meta-data to the PDF.

Prepressure.com presents some clear discussion on the embedded properties of meta-data in PDFsPreppressure.com has a really helpful section on PDFs and various issues pertaining to PDFs and their use. http://www.prepressure.com/pdf/basics.
Their discussion on meta-data can be found at http://www.prepressure.com/pdf/basics/metadata.

How metadata are stored in PDF files

There are several mechanisms available within PDF files to add metadata:

The Info Dictionary has been included in PDF since version 1.0. It contains a set of document info entries, simple pairs of data that consist of a key and a matching value. Some of these are predefined, such as Title, Author, Subject, Keywords, Created (the creation date), Modified (the latest modification date) and Application (the originating application or library). Applications can add their own sets of data to the info dictionary.

XMP (Extensible Metadata Platform) is an Adobe technology for embedding metadata into files. It can be used with a wide variety of data files. With Acrobat 5 and PDF 1.4 (2001) this mechanism was also made available for PDF files. XMP is more powerful than the info dictionary, which is why it is used in a number of PDF-based metadata standards.

Additional ways of embedding metadata are the PieceInfo Dictionary (used by Illustrator and Photoshop for application specific data when you save a file as a PDF), Object Data (or User Properties) and Measurement Properties.

PDF metadata standards

There are a number of interesting standards for enriching PDF files with metadata. Below is a short summary:

There are PDF substandards such as PDF/X and PDF/A that require the use of specific metadata. In a PDF/X-1a file, for example, there has to be a metadata field that describes whether the PDF file has been trapped or not.

The GWG ad ticket provides a standardized way to include advertisement metadata into a PDF file.

Certified PDF is a proprietary mechanism for embedding metadata about preflighting – whether a PDF file intended to be printed by a commercial printer or newspaper has been properly checked on the presence of all fonts, images with a sufficient resolution,…

The filename is metadata as well

The easiest way to add information about a PDF to the file is by giving it a proper filename. A name like ‘SmartGuide_12_p057-096_v3.pdf’ tells a recipient much more about what the file is about than ‘pages_part2_nextupdate.pdf’ does.

Add the name of the publication and possibly the edition to the filename.

Add a revision number (e.g. ‘v3′) if there will be multiple updates of a file.

If a file contains part of the pages of a publication add at least the initial folio to the filename. That allows people to easily sort files in the right order. Use 2 or 3 digits for the page number (e.g. ‘009′ instead of just ‘9′).

Do not use characters that are not supported in other operating systems or that have a special meaning in some applications: * < > [ ] = + ” \ / , . : ; ? % # $ | & •.

Do not use a space as the first or last character of the filename.

Don’t make the filename too long. Once you go beyond 50 characters or so people may not notice the full information or the filename may get clipped in browser windows or applications.

Many prepress workflow systems can automatically insert files into a job based on a specific naming convention. This speeds up the processing of the job and can avoid costly mistakes. Consult with your printer – they may have guidelines for submitting files.

Even on my favorite operating system, OS X there are several methods available to users for making PDFs of documents. These methods do not all create the same PDFs. (The difference is in the meta-data contained and in the size of the files.) This is pointed out by Rob Griffiths ^[2]Rob Griffiths. Keep some PDF info private. Macworld.com. Mar 1, 2007 2:00 am. <Accessed 14 March 2011>. [Link] in Macworld in an article on privacy, and being aware of PDF meta-data which might transmit more personal information than the document creator might desire. However, what Rob points out is that there are several methods of producing PDFs on OS X and these various methods include or exclude various meta-data details. Just as privacy concerns might motivate the removal of embedded meta-data (or perhaps the creation of PDF without meta-data), the accuracy of archive quality should drive the inclusion of meta-data in PDF files hosted by archives. There are two obvious ways to increase the quality of a PDF in an archive:

The individual can enrich the PDF with meta-data prior to submission (risking that the institution will strip the meta-data embedded and input their own meta-data values)
The archive can systemically enrich the meta-data based on the other meta-data collected on the file while it is in their “custody”.

As individuals we can take responsibility for the first point. There are several open source tools for editing the embedded meta-data, one of these is pdfinfoAnother command line tool is ExifTool (Link to Wikipedia). ExifTool is more versatile, working with more file types than just PDF, but again this tool does not have a GUI.. I wish I could find a place to download this command line tool, but it only seems to be in linux software repositories. However, there are several other command line packages which incorporate this utility. One of these packages is xpdf. Xpdf is available under GPL for personal use from foolabs. The code has to be compiled from source but there are links to several other websites with compiled versions for various OSes. There is an OS package installer available from phg-online.de. For those of us who are strong believers in GUIs and loath the TUI (Text User Interface, or command line) there is a freely available GUI for pdfinfo from sybrex.com.

Because I use PDFs extensively in matters of linguistic research I thought that I would take look at several PDFs from a variety of sources. This would include:

JSTOR: Steele (1976) ^[3]Susan M. Steele. 1976. A Law of Order: Word Order Change in Classical Aztec. International Journal of American Linguistics, vol 42 (1): 31-45. [Link] . JSTOR is well known archive in academic circles (especially the humanities).
Project Muse: (Language) Ladefoged (2007) ^[4]Peter Ladefoged. 2007. Articulatory Features for Describing Lexical Distinctions. Language 83.1: 161-80. . Project Muse is also another well known repository for the humanities. Langauge is a well respected journal in the linguistic sciences, published by the Linguistic Society of America.
Cambridge Journals: (Journal of the International Phonetic Association) Olson, Mielke, Olson, Sanicas-Daguman, Pebley and Paterson (2010) ^[5]Kenneth S. Olson, Jeff Mielke, Josephine Sanicas-Daguman, Carol Jean Pebley & Hugh J. Paterson III. 2010. The phonetic status of the (inter)dental approximant. Journal of the International … Continue reading Cambridge Press, of which Cambridge Journals is a part, is a major publisher of linguistic content in the English academic community.
SIL Academic Publishing: Gardner and Merrifield (1990) ^[6]Richard Gardner and William R. Merrifield. 1990. Quiotepec Chinantec tone. In William R. Merrifield and Calvin R. Rensch (eds.), Syllables, tone, and verb paradigms: Studies in Chinantec languages 4, … Continue reading This PDF is found through the SIL Bibliography, but prepared by Academic Publishing (department) of SIL.It is important to note that this work was made available through SIL’s Global Publishing Service (formerly Academic publishing) not through the Language and Culture Archives. This is evidenced by the acpub used in the URL for accessing the actual PDF: www.sil.org/acpub/repository/24341.pdf. As a publishing service, this particular business unit of SIL is more apt to be aware of and use higher PDF standards like PDF/A in their workflows.
SIL – Papua New Guinea: Barker and Lee (n.d.) ^[7] Fay Barker and Janet Lee. Available: 2009; Created: n.d.. A tentative phonemic statement of Waskia. [Manuscript] 40 p. [Link]. but made available online in 2009 by SIL – Papua New Guinea.
SIL Mexico Branch: Benito Apolinar Antonio, et al. MWP#9a ^[8]Benito Apolinar Antonio, et al. 2010. Vocabulario básico en me’phaa. SIL-Mexico Electronic Working Papers #9a. [PDF]. and Benito Apolinar Antonio, et al. MWP#9b ^[9]Benito Apolinar Antonio, et al. 2010. Vocabulario básico en me’phaa. SIL-Mexico Electronic Working Papers #9b. SIL International. [PDF]. It is interesting to note that the production tool used to create the PDFs for the Mexico Branch Work Papers was XLingPaper. ^[10]H. Andrew Black. 2009. Writing linguistic papers in the third wave. SIL Forum for Language Fieldwork 2009-004:11. http://www.sil.org/silepubs/abstract.asp?id=52286. [PDF] ^[11]H. Andrew Black, and Gary F. Simons. 2009. Third wave writing and publishing. SIL Forum for Language Fieldwork 2009-005: 15 http://www.sil.org/silepubs/abstract.asp?id=52287. [PDF] XLingPaper is a plugin for XMLMind, an XML editor. It is used for creating multiple products from a single XML data source. (In this case the data source is the linguistics paper.) However, advanced authoring tools like XLingPaper, LaTeX and its flavors like XeTeX should be able to handel assignment of keywords and meta-data on the creation fo the PDF.
Example of a PDF made from Microsoft Word: Snider (2011) ^[12]Keith Snider. 2011. On Discovering Contrastive Tone Melodies. Paper presented at the Berkley Tone Workshop, 18-20 February 2011, University of California, Berkley.
Example of a PDF made from Apple Pages: Paterson and Olson (2009) ^[13]Hugh Paterson III and Kenneth Olson. 2009. An unlikely retention. Paper presented at the 11th International Conference on Austronesian Linguistics, 22–26 June 2009, Aussois, France.

The goal of the comparison is to look at a variety of PDF resources from a variety of locations and content handlers. I have included two linguistic journals, two repositories for journals, and several items from various SIL outlets. Additionally, I have included two different PDFs which were authored with popular wordprocessing applications. To view the PDFs and their meta-data I used Preview, a PDF and media viewer which ships with OS X, and is created by Apple. Naturally, the scope of the available meta-data to be viewed is limited to what Preview is programed to display. Adobe Acrobat Pro will display more meta-data fields in its meta-data editing interface.

JSTOR:

Using Preview in OS X to look at the embedded meta-data in a PDF from JSTOR.
Project Muse:

Using Preview on OS X to look at the embedded meta-data of a PDF from Project Muse and the journal Langauge
Cambridge Journals:

Using Preview on OS X to look at the embedded meta-data of a PDF from Cambridge Journals and the Journal of the International Phonetic Association
SIL Academic Publishing (not the archive):

Using Preview on OS X to look at the embedded meta-data of a PDF as it was Prepared by Academic Publishing

A close up view of the Keywords meta-data of as it was Prepared by Academic Publishing. Academic Publishing was the only one in the set of PDFs surveyed to use Keywords. They were also the only one to use or embed the ISO 639-3 code of the subject language of the item.
Among the PDFs surveyed Academic Publishing was the only producer to use Keywords. They were also the only one to use or embed the ISO 639-3 code of the subject language of the item.
SIL – Papua New Guinea:

Using Preview on OS X to look at the embedded meta-data of a PDF prepared by SIL - Papua New Guinea
SIL Mexico Branch:
Work Papers #9a

Using OS X to look at the embedded meta-data of a PDF prepared by SIL - Mexico Branch

Work Papers #9b

Using Preview on OS X to look at the embedded meta-data of a PDF prepared by SIL - Mexico Branch
MS Word Example:

Using Preview on OS X to look at the embedded meta-data of a PDF prepared by an Individual using MS Word
Notice that in the title that the application used to create the PDF inserts “Microsoft Word – ” Before the document title.
Apple Pages Example:

Pages Document Inspector showing where one can edit the meta-data which will be passed to the PDF when created using the Export option.

As we can see from the images presented here there is not a wide spread adoption of a systematic process on the part of:

publishers
or on the part of developers of writing utilities, like MS Word, or XLingPaper, to encourage the enduser to produce enriched PDFs.
Additionally, there is not a systemic process used by content providers to enrich content produced by publishers.

However, enriched content (PDFs) is used by a variety of PDF management applications and citation management software. That is, consumers do benefit from the enriched state of PDFs and consumers are looking for these featuresThe discussion on Yep 2’s forums high-lights this point. Yep 2 is a consumer / Desktop media & PDF management tool. There are several other tools out there like Papers2, Mendeley, Zotero even Endnote..

If I were to extend this research I would look at PDFs from more content providers. I would look for a PDF from an Open Access Repository like the Rugters Optimality Archive, a Dissertation from ProQuest, I would also look for some content from a reputable archive like PARADISEC, and something from a DSpace implementationXpdf can be used in conjunction with DSpace, in fact it is even mentioned in the manual..

References[+]

References
↑1	Yishai . 1 July 2009. All PDF’s are not created equal. Part III (out of III). Digilabs Technologies Blog. http://digilabsblog.wordpress.com/2009/07/01/all-pdf’s-are-not-created-equal-part-iii-out-of-iii/. [Link] [Accessed: 23 January 2012]
↑2	Rob Griffiths. Keep some PDF info private. Macworld.com. Mar 1, 2007 2:00 am. <Accessed 14 March 2011>. [Link]
↑3	Susan M. Steele. 1976. A Law of Order: Word Order Change in Classical Aztec. International Journal of American Linguistics, vol 42 (1): 31-45. [Link]
↑4	Peter Ladefoged. 2007. Articulatory Features for Describing Lexical Distinctions. Language 83.1: 161-80.
↑5	Kenneth S. Olson, Jeff Mielke, Josephine Sanicas-Daguman, Carol Jean Pebley & Hugh J. Paterson III. 2010. The phonetic status of the (inter)dental approximant. Journal of the International Phonetic Association 40.02: 199-215. [Link]
↑6	Richard Gardner and William R. Merrifield. 1990. Quiotepec Chinantec tone. In William R. Merrifield and Calvin R. Rensch (eds.), Syllables, tone, and verb paradigms: Studies in Chinantec languages 4, 91-105. Summer Institute of Linguistics and the University of Texas at Arlington Publications in Linguistics, 95. Dallas: Summer Institute of Linguistics and the University of Texas at Arlington. [PDF]
↑7	Fay Barker and Janet Lee. Available: 2009; Created: n.d.. A tentative phonemic statement of Waskia. [Manuscript] 40 p. [Link].
↑8	Benito Apolinar Antonio, et al. 2010. Vocabulario básico en me’phaa. SIL-Mexico Electronic Working Papers #9a. [PDF].
↑9	Benito Apolinar Antonio, et al. 2010. Vocabulario básico en me’phaa. SIL-Mexico Electronic Working Papers #9b. SIL International. [PDF].
↑10	H. Andrew Black. 2009. Writing linguistic papers in the third wave. SIL Forum for Language Fieldwork 2009-004:11. http://www.sil.org/silepubs/abstract.asp?id=52286. [PDF]
↑11	H. Andrew Black, and Gary F. Simons. 2009. Third wave writing and publishing. SIL Forum for Language Fieldwork 2009-005: 15 http://www.sil.org/silepubs/abstract.asp?id=52287. [PDF]
↑12	Keith Snider. 2011. On Discovering Contrastive Tone Melodies. Paper presented at the Berkley Tone Workshop, 18-20 February 2011, University of California, Berkley.
↑13	Hugh Paterson III and Kenneth Olson. 2009. An unlikely retention. Paper presented at the 11th International Conference on Austronesian Linguistics, 22–26 June 2009, Aussois, France.

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Tag Archives: archival

Mode vs. Medium

Digitization Services

Profesional Services

Doing it on our own

Presentation version vs. Archival version of Digital Audio files

What is an archival version of an audio file?

What is a presentation version? (of an audio file)

Media stream quality variations

Media container formats

Possibilities with embedded metadata

Citations, Names and Language Documentation

Smart Lists and UI

Letting Go

TIFFs, PDFs and OCR

So is OCR built into PDFs? or is there a need for independent OCR?

The better OCR option

Are all files used to create an orignal PDF included in the PDF?

All PDFs are not created Equal

Validity

Resolution of Contained Images

Meta-data