A Minimal Set of meta-data to strive to collect for each Photo

Posted on July 10, 2011 by Hugh Paterson III

A Minimal Set of meta-data to strive to collect:

Photo ID Number: ______________________________

Collection:____________________________________

Sub-Collection:_________________________________

Who (photographer): ____________________________

Who (subject): _________________________________

Who (subject): ________________________________

People group:_________________________________

When (was the photo taken): _______________________

Where (Country): _______________________________

Where (City): _________________________________

Where (Place): ________________________________

What is in the Photo: ____________________________

Why was it taken (Event):_________________________

Description:____________________________________

Who (provider): ________________________________

Who (provider): _______________________________

Social Meta-data collection

Posted on June 29, 2011 by Hugh Paterson III

As part of my job I work with materials created by the company I work for, that is the archived materials. We have several collections of photos by people from around the world. In fact we might have as many as 40,000 photos, slides, and Negatives. Unfortunately most of these images have no Meta-data associated with them. It just happens to be the case that many of the retirees from our company still live around or volunteer in the offices. Much of the meta-data for these images lives in the minds of these retirees. Each image tells a story. As an archivist I want to be able to tell that story to many people. I do not know what that story is. I need to be able to sit down and listen to that story and make the notes on each photo. This is time consuming. More time consuming than I have.

Here is the Data I need to minimally collect:

Photo ID Number: ______________________________
Who (photographer): ____________________________
Who (subject): ________________________________
People group:_________________________________
When (was the photo taken): _______________________
Where (Country): _______________________________
Where (City): _________________________________
Where (Place): ________________________________
What is in the Photo: ____________________________
Why was the photo taken (At what event):_________________________
Photo Description:__short story or caption___
Who (provided the Meta-data): _________________________

Here is my idea: Have 2 volunteers with iPads sit down with the retirees and show these pictures on the iPads to the retirees and then start collecting the data. The iPad app needs to be able to display the photos and then be able to allow the user to answer the above questions quickly and easily.

One app which has a really nice UI for editing photos is PhotoForge. [Review].

The iPad is only the first step though. The iPad works in one-on-one sessions working with one person at a time. Part of the overall strategy needs to be a cloud sourcing effort of meta-data collection. To implement this there needs to be central point of access where interested parties can have a many to one relationship with the content. This community added meta-data may have to be kept in a separate taxonomy until it can be verified by a curator, but there should be no reason that this community added meta-data can not be expected to be valid.

Meta-data Collection Model

However, what the app needs to do is more inline with MetaEditor 3.0. MetaEditor actually edits the IPTC tags in the photos – Allowing the meta-data to travel with the images.In one sense adding meta-data to an image is annotating an image. But this is something completely different than what Photo Annotate does to images.

Photo Annotate

Photosmith seems to be a move in the right direction, but it is focused on working with Lightroom. Not with a social media platform like Gallery2 & Gallery3, Flickr or CopperMine.While looking at open source photo CMS’s one of the things we have to be aware of is that meta-data needs to come back to the archive in a doublin core “markup”. That is it needs to be mapped and integrated with our current DC aware meta-data scehma. So I looked into modules that make Gallery and Drupal “DC aware”. One of the challenges is that there are many photo management modules for drupal. None of them will do all we want and some of them will do what we want more elegantly (in a Code is Poetry sense). In drupal it is possible that several modules might do what we want. But what is still needed is a theme which elegantly, and intuitively pulls together the users, the content, the questions and the answers. No theme will do what we want out of the box. This is where Form, Function, Design and Development all come together – and each case, especially ours is unique.

This, cloud sourcing of meta-data model has been implemented by the Library of Congress in the Chronicling America project. Where the Library of Congress is putting images out on Flickr and the public is annotating (or “enriching” or “tagging” ) them. Flickr has something called Machine Tags, which are also used to enrich the content.

There are two challenges though which still remain:

How do we sync offline iPad enriched photos with online hosted images?
How do we sync the public face of the hosted images to the authoritative source for the images in the archive’s files?

Network Language Documentation File Management

Posted on May 4, 2011 by Hugh Paterson III

This post is a open draft! It might be updated at any time… But was last updated on at .

Meta-data is not just for Archives

Bringing the usefulness of meta-data to the language project workflow
It has recently come to my attention that there is a challenge when considering the need for a network accessible file management solution during a language documentation project. This comes with my first introduction to linguistic field experience and my first field setting for a language documentation project.The project I was involved with was documenting 4 Languages in the same language family. The Location was in Mexico. We had high-speed Internet, and a Local Area Network. Stable electric (more than not). The heart of the language communities were a 2-3 hour drive from where we were staying, so we could make trips to different villages in the language community, and there were language consultants coming to us from various villages. Those consultants who came to us were computer literate and were capable of writing in their language. The methods of the documentation project was motivated along the lines of: “we want to know ‘xyz’ so we can write a paper about ‘xyz’ so lets elicit things about ‘xyz'”. In a sense, the project was product oriented rather than (anthropological) framework oriented. We had a recording booth. Our consultants could log into a Google-doc and fill out a paradigm, we could run the list of words given to us through the Google-doc to a word processor and create a list to be recorded. Give that list to the recording technician and then produce a recorded list. Our consultants could also create a story, and often did and then we would help them to revise it and record it. We had Geo-Social data from the Mexican government census. We had Geo-spacial data from our own GPS units. During the corse of the project massive amounts of data were created in a wide variety of formats. Additionally, in the case of this project language description is happening concurrently with language documentation. The result is that additional data is desired and generated. That is, language documentation and language description feed each other in a symbiotic relationship. Description helps us understand why this language is so important to document and which data to get, documenting it gives us the data for doing analysis to describe the language. The challenge has been how do we organize the data in meaningful and useful ways for current work and future work (archiving)?People are evidently doing it, all over the world… maybe I just need to know how they are doing it. In our project there were two opposing needs for the data:

Data organization for archiving.
Data organization for current use in analysis and evaluation of what else to document.It could be argued that a well planned corpus would eliminate, or reduce the need for flexibility to decide what else there is to document. This line of thought does have its merits. But flexibility is needed by those people who do not try to implement detailed plans.

Continue reading →

Meꞌphaa Bibliography

Posted on February 22, 2011 by Hugh Paterson III

This is an experimental use of Mendeley's API to present a bibliography of materials used in the Meꞌphaa language documentation project.There are are several limitations to the WordPress plugin because it does not bring over all the reference types. This is partly a limitation of Mendeley's API and partly a limitation of the reference types they support in their application. The WordPress plugin ignores that some references do not have the same parts in their citations. Some form of CSL should be used in the plugin. More about Citation Style Language. One other thing that I have noticed is that when there is a URL which ends in a .pdf the plugin re-codes the link name to "pdf". This is the advertised behavior. However, when there is more than one URL, they all say "url" rather than what is the last part of the URI. Look at this example from above:

Steven Egland, Doris Bartholomew, Saúl Cruz Ramos (1978) La inteligibilidad interdialectal de las lenguas indígenas de México: Resultado de algunos sondeos, Instituto Lingüístico de Verano, p. 58-59, Mexico City: Instituto Lingüístico de Verano, url, url

Languages of Western Central Mexico

[mendeley type="groups" id="899061" groupby="year" grouporder="desc"]

Using Endnote X4 for Mac

Posted on February 10, 2011 by Hugh Paterson III

One of the most popular Citation Management software applications among academics is the application Endnote. Endnote has a long history is published by a reputable company, and has some pretty cool features. I use it (version X4) primarily because it is the only citation softwareThere is other citation management software for OS X which claims integration with pages but none of these solutions are endorsed or supported by Apple. Some of the other applications which claim integration with Pages are:

Sente
Bookends
Papers – This is according to Wikipedia, but I own and use Papers 1.9.7 and have not seen how to integrate it with Pages. (However, Papers2, released March 8th, 2011 does say that it supports citation integration with Pages.)

which integrates natively with the word processor Pages, by Apple, Inc. The software boasts a bit of flexibility and quite a few useful features. Some of the really useful features I use are below.

Customizing the output style of the bibliographies.
There are several Linguistics Journals with style sheets on Endnote’s Website. Among them are:
Additionally there is a version of the Unified Linguistics Style Sheet available for Endnote. This is available from Manchester UK. http://www.llc.manchester.ac.uk/intranet/ug/useful-links/computing-resources/wordprocessing/. [.ens file]
Looking for PDF files.
Attaching additional meta-data to each citation. (Like ISO 639-3 Language Codes)
Adding additional types of resources like Rutgers Optimality Archive Documents with an ROA number.
Smart Groups of files based on desired criteria.
Integration with Apple’s word processor Pages.
Research Notes section in the citation’s file for creating an annotated bibliography.
Copy our all the selected works, so that they can be pasted as a bibliography in another document.
XML Output of Citation Data
The XML Support of Endnote has not been hailed as the greatest implementation of XML but there are tools out there to work with it.
- http://www.uns.ethz.ch/pub/publications/xml_format
- XSL Transformation stylesheet for EndNote XML.

However, regardless of how many good features I find and use in Endnote there are several things about it which irk me to no end. This is sort of a laundry list of these problematic areas.

~~Can not sort by resource type~~:
For instance if I wanted to sort or create a smart list of all my Book references, or just Journal Articles. This can be done, one just has to create a smart list and then set Reference Type to Contains: “Book Section”. There is not a drop down list of reference types invoked by the user.
~~Can not sort by custom field~~:
I think you can do this in the interface. Though it was not obvious on how to do it.
Endnote Display Fields
Can not view all the custom fields for a resource type across all resources.
This seems to be limited to eight fileds in the sorting viewer at a time.
~~Can not view all entries without content in a specified field.~~
This would be especially nice to create a smart list for this.
No exports of PDFs or exports of PDFs with .ris files.
There is no keyboard short-cut to bring up the Import function (or Export) under File Menu
Does not rename PDFs based on metadata of the resource.
This is possible with Papers and Mendeley. The user has the option to rename the file based on things like Author, Date of publication, etc.
Can not create a smart list based on a constant in the Issue data part.
I have Volume and Issue Data. Some of the citation data pulled in for some items has the issue set as 02, 03, etc. I want to be able to find all the issues which start with a zero so I can remove the zeros. Most stylesheets do not remove the zeros and also do not allow for them.
Items are not selectable based on Issue Data
Can not export PDFs with embedded metadata in the PDF.
Can not open the folder which contains a PDF included in an Endnote Library.
Endnote does not have a way to open the containing folder of a PDF

Close-up of the same menu.
Modifying Resource type does not accept |Language| Subject Language|
~~There is no guide in any of Endnote’s documentation for how to create an export style sheet.~~
This is in the Help Menus I was expecting it on the producers website or in a book.
When editing an entry’s meta-data i.e. the author, or the title of a work, pressing TAB does not move the cursor to the next field.
At least some times it does not continue to TAB. If I do a new entry as a Journal article, then it will tab till the issue field, but not beyond. It gets stuck.
There is no LAN collaboration or sharing feature for a local network solution.
There is no Cloud based collaborative solution.
There is no way to create a smart group based off of a subset of items in a normal group.
i.e. I want to create a smart group of all the references with a PDF attached but I only want it to pull from the items in a particular group (or set of groups).
Endnote has no option for 'part of group' filter
There is no PDF Preview within the application. The existing Preview is for seeing the current citation in the selected citation style. (Preview of the output.) It would be helpful if there was also a preview pane for viewing the PDF or the attached file.
No PDF Preview internal to Endnote

Notes and a Bibliography

Posted on February 8, 2011 by Hugh Paterson III

I have been looking for a way to create Posts with both Footnotes and a Bibliography section. I have wanted to make my post a little more professional looking, and let the information flow more easily with the way I write. What I have come to realize, is that Footnotes and Endnotes are different and function differently in respect to information processing. Traditionally, in print media Endnotes have occurred at the end of the article, whereas Footnotes have occurred at the end of the page on which the footnote is mentioned. This leads to a three way breakdown:

Footnotes
Endnotes
Bibliography

The purpose of footnotes is to facilitate quick information processing without breaking the flow of reading or information processing of the consumer of information. On web-based media, the end of the article and the end of the page is the same if pagination is not enabled. So this creates a sort of syncretism between Endnotes and Footnotes. However, the greater principal of quick reference to additional information still applies on the web. There are several strategies which have tried to fill this information processing nitch, these include things like:

Tooltips (The pop-up text which appears when your mouse cursor hovers over a link or some other text.)
Lightbox (The darker shading of the background and the high-lighting of the content in focus.)
Pop-up windows (which have been phased out of popular "good web design").
Information (Text) balloons (an example of this is Wikipop Wikipop is really a combination of the above mentioned effects above to create an inline experience for the user. But some web-sites have a similar effect which is dependent on the mouse hovering over the "trigger".).

With strategies for conveying information like Tooltips it is possible to meet the same information communication and information processing goals which were formerly achieved through footnotes. For Web-based information, which is intended to be consumed through a web medium Wiki-pop makes a lot of sense. However, if the goal is a good print out of content then footnotes are still needed, that is why I am using footnotes on this particular web presentationA solution which does both, tooltips or solutions like Wikipop, and footnotes when the content is printed, would be ideal. .

So here is a quick post on how I am doing it.
I am using two different "endnotes" plugins. One for the Bibliography section and the other for the Notes section.

Creating the Footnotes section:
To create the notes section I have elected to use a plugin called FootnotesEven though there are other options for Footnote Plugins. One other option I know about is FD-Footnotes. by Rob Miller. (Big surprise on the name of the plugin...) Footnotes allows for me to put what I want to show up as footnotes in <ref>something</ref>In order to get these tags to display inside of <code> and </code> tags I had to use HTML codes for the greater than sign, less than sign and slash. There is some additional good information about character encoding in HTML on Wikipedia. tags.
Additionally I can set a tag <reference /> anywhere in the post and produce a list of footnotes.

Creating the Bibliography:
To create the Bibliography Section I am using WP-Footnotes (in the WP plugins repository) by Simon Elvery. More information can be read about his plugin here. What this plugin allows me to do is to craft the citation of the item I want to cite. I have to figure out how I want to "code" the citation and then present the citation.

^[1]Hand Code the contents of the citation as it is to appear in the bibliography here, between a set of double parentheses.

Do not forget a space after the citation text and the double closing parentheses.

This will produce a citation marker (a number) as a super script inline with the text. Like this ^[2]Nikolaus P. Himmelmann. 1998. Documentary and Descriptive Linguistics. Linguistics vol. 36:161-195. [PDF] [Accessed 24 Dec. 2010] :
And that will produce a citation in the bibliography section like the following:

Nikolaus P. Himmelmann. 1998. Documentary and Descriptive Linguistics. Linguistics vol. 36:161-195. [PDF] [Accessed 24 Dec. 2010]

One interesting thing that occurs on the admin side of WordPress is that the plugin WP-Footnotes has an options page which shows up in the Settings menu, however what is interesting is that in that in the menu it is called Footnotes, not WP-Footnotes.
The options for WP-Footnotes really make it flexible, it is these setting which have allowed me to rename the section from Notes to Bibliography.

WP-Footnote Options

Final solution?
Is this my final solution? No. One thing I really don't like is that the bibliography is not orderd in alphabetical order of the last names, and then in order of the year of publication. Rather, citations are ordered in the order of appearance (as footnotes generally are). The plugin does not have any options for changing the order that thing appear in (though the headings on the ordered list can be changed). There is also no way to structure the data in the bibliography for reuse (even if it is just within this site), so each use of each citation must be hand-crafted with love. There are some other solutions which I am looking at integrating with this one but have not had time to really explore. One options is to integrate with Mendeley and aggregate bibliography data from a Mendeley collection. Another option is to create bibliographies as bibtex files and then use those to display the bibliography.

References[+]

References
↑1	Hand Code the contents of the citation as it is to appear in the bibliography here, between a set of double parentheses.
↑2	Nikolaus P. Himmelmann. 1998. Documentary and Descriptive Linguistics. Linguistics vol. 36:161-195. [PDF] [Accessed 24 Dec. 2010]

No hCite format defined

Posted on February 1, 2011 by Hugh Paterson III

I am looking to re-skin Wikindex. I thought that I would add some CSS classes that would embed the meta-data in a manner that the citations could be picked up by Zotero quite easily. It seems to be a bit more difficult than I first anticipated. As a Microformat for citations is not yet been fully fleshed out. Obviously one way to go would be to embed everything in a span element as COinS does but that is not really what I am looking for. (Mostly because I don’t have a way to generate the Attributes in the span element automatically.) I have thought of using RDFa. But I still need to do some more research and see what can be gleaned in terms of which controlled vocabularies to use. I am hoping that this Lesson On RDFa will really help me out here. Finally I do need to know something about OAI so that once the Resources are put into Wikindex I can then tell OLAC what language they belong to.

Open Source Language Codes Meta-data

Posted on July 10, 2010 by Hugh Paterson III

One of the projects I have been involved with published a paper this week in JIPA. It is a first for me; being published. Being the thoughtful person I am, I was considering how this paper will be categorized by librarians. For the most part papers themselves are not catalogued. Rather journals are catalogued. In a sense this is reasonable considering all the additional meta-data librarians would have to create in their meta-data tracking systems. However, in today’s world of computer catalogues it is really a shame that a user can’t go to a library catalogue and say what resources are related to German [deu]? As a language and linguistics researcher I would like to quickly reference all the titles in a library or collection which reference a particular language. The use of the ISO 639-3 standard can and does help with this. OLAC also tires to help with this resource location problem by aggregating the tagged contents of participating libraries. But in our case the paper makes reference to over 15 languages via ISO 639-3 codes. So our paper should have at least those 15 codes in its meta-data entry. Furthermore, there is no way for independent researchers to list their resource in the OLAC aggregation of resources. That is, I can not go to the OLAC website and add my citation and connect it to a particular language code.

There is one more twist which I noticed today too. One of the ISO codes is already out of date. This could be conceived of as a publication error. But even if the ISO had made its change after our paper was published then this issue would still be persistent.

During the course of the research and publication process of our paper, change request 2009-78 was accepted by the ISO 639-3 Registrar. This is actually a good thing. (I really am pro ISO 639-3.)

Basically, Buhi’non Bikol is now considered a distinct language and has been assigned the code [ubl]. It was formerly considered to be a variety of Albay Bicolano [bhk]. As a result of this change [bhk] has now been retired.

Here is where we use the old code, on page 208 we say:

voiced velar fricative [ɣ]

Aklanon [AKL] (Scheerer 1920, Ryder 1940, de la Cruz & Zorc 1968, Payne 1978, Zorc 1995) (Zorc 1995: 344 considers the sound a velar approximant)

Buhi’non [BHK] (McFarland 1974)

In reality McFarland did not reference the ISO code in 1974. (ISO 639-3 didn’t exist yet!) So the persistent information is that it was the language Buhi’non. I am not so concerned with errata or getting the publication to be corrected. What I want is for people to be able to find this resource when they are looking for it. (And that includes searches which are looking for a resource based on the languages which that resource references.)

The bottom line is that the ISO does change. And when it does change we can start referencing our new publications and data to the current codes. But there are going to be thousands of libraries out there with out-dated language codes referencing older publications. A librarian’s perspective might say that they need to add both the old and the new codes to the card catalogues. This is probably the best way to go about this. But who will notice that the catalogues need to be updated with the new codes? What this change makes me think is that there needs to be an Open Source vehicle where linguists and language researchers can give their knowledge about a language resources a community. Then librarians can pull that meta-data from that community. The community needs to be able to vet the meta-data so that the librarians feel like it is credible meta-data. In this way the quality and relevance of Meta-data can always be improved upon.

All PDFs are not created Equal

Posted on February 18, 2010 by Hugh Paterson III

Many of us use PDFs every day. They’re great documents to work with and read from because of their ease of use and ease to create.

I think I started to use PDFs for the first time in 2004. That’s when I got my first computer. Since that time, most PDFs I have needed to use have just worked. In the time that I have been using PDFs I have noticed that there are at least two major ways in which PDFs are not created equally:

Validity of the PDF: Adherence to the PDF document standard.
Resolution of contained images
The presence and accuracy of the PDF’s meta-data.

Validity

Since 2004, there have only been a few PDFs which after creation and distribution would not render by any of my PDF readers, or on the readers my friends used (most of these PDFs were created by Microsoft Word or Microsoft Publisher on Windows and actually one or two created by Apple’s word processor Pages). Sometimes these errors had to do with a particular image included in the source document. The image may have been malformed, but this was not always the case. Sometimes it was the PDF creator, which was creating non-cross-platform PDFs.

Not all PDFs are created equal. (This is inherently true when one considers the PDF/A The University of Michigan put a small flyer together on how to get something like a PDF/A to print from MS Word on OS X and Windows. [Link], and PDF/X standards, however lets side-step those standards for a moment.) To frame this discussion, it is necessary to acknowledge that there is a difference between creating a digital document with a life expectancy of 3 weeks and one with a life expectancy of 150 years. So for some applications, what I am about to say is a moot point. However, looking towards the long term…

If an archival institution wants a document as a PDF, what are the requirements that that PDF needs to have?

What if the source document is not in Unicode? Is the font used in the source document automatically embedded in the PDF upon PDF creation? Consider this from PDFzone.com

Embedding Fonts in a PDF
Another common area of complaint among frequent PDF users is font incompatibility and problems with font embedding. Here are some solutions and tips for putting your best font forward so to speak.

Keep in mind that when it comes to embedding fonts in a PDF file you have to make certain that you have the correct fonts on the system you’re using to make the conversion. Typically you should embed and subset fonts, although there are always exceptions.

If you just need a simple solution that will handle the heavy font work for you, the WonderSoft Virtual PDF Printer helps you choose and embed your fonts into any PDF. The program supports True Type and Unicode fonts.

The left viewing window shows you all the fonts installed on your system and the right viewing window shows the selected user fonts to embed into a newly created PDF form. A single license is $89.95.

Another common solution is the 3-Heights Optimization PDF Optimization Tool [Link Removed].

One of the best sources of information on all things font is at the Adobe site itself under the Developer Resources section.

3-Heights does have an enterprise level PDF validator. I am not sure if there is one in the OpenSource world But it would seem to me that any Archival Institution should be concerned with not just having PDFs in their archive but also keenly interested in having valid PDFs in their archives. This is especially true when we consider that one of todays major security loopholes is malformed file types, i.e. PDFs that are not really PDFs or PDFs with something malicious attached or embeddedHere is a nice Blog Post about embedding a DLL in a PDF. I am sure that there is more than one method to this madness but it only takes one successful attempt to create a security breach. In fact there are several methods reported, some with javascript some without. Here are a few:

Apparently, several kinds of media can be embed in PDFs. These include: movies and songs, JavaScript, and forms that upload data a user inputs to a web server within PDFs. And there’s no forgetting the function within PDF specs to launch executables..

Printing the PDF does not seem to be a fail proof method to see if the PDF is valid or even usable. See this write up from The University of Sydney School of Mathematics and Statistics:

Problem
I can read the file OK on screen, but it doesn’t print properly. Some characters are missing (often minus signs) and sometimes incorrect characters appear. What can I do?
Solution
When the Acrobat printing dialog box opens, put a tick in the box alongside “print as image”. This will make the printing a lot slower, but should solve the problem. (The “missing minus signs” problem seemed to occur for certain – by now rather old – HP printers.)
(Most of these problems with pdf files are caused by subtle errors with the fonts the pdf file uses. Unfortunately, there are only a limited number of fonts that supply the characters needed for files that involve a lot of mathematics.)

Printing a PDF is not necessarily a fail proof way to see if a PDF is usable. Even if the PDF is usable, printing the PDF does not ensure that it is a valid PDF either. When considering printing as a fail proof method one should also consider that PDFs can contain video, audio, and flash content. So how is one to print this material? Or in an archival context determine that the PDF is truly usable? A valid PDF will render and archive correctly because it conforms to the PDF standard (what ever version of that standard is declared in the PDF). Having a PDF conform to a given PDF standard puts the onus on the creator of the PDF viewer (software) to implement the PDF standard correctly. Thus making the PDF usable (as intended by the PDF’s author).

Note: I have not checked the Digital Curation Center for any recommendations on PDFs and ensuring their validity on acceptance to an archive.

Resolution of Contained Images

A second way that PDF documents can vary is that the resolutions of the images contained in them can vary considerably. The images inside of a PDF can be a variety of image formats, .jpg, .tiff, .png, etc. So the types of compression and the looseness of these compressions can make a difference in the archival “quality” of a PDF. A similar difference is is noted to be the difference in a raster PDF and a Vector PDF. ^[1]Yishai . 1 July 2009. All PDF’s are not created equal. Part III (out of III). Digilabs Technologies Blog. … Continue reading Beside these two types of differences there are various PDF printers, which print materials to PDF in various formats. This manual discusses setting Adobe Acrobat Pro’s included PDF printer.

Meta-data

A third way in which PDFs are not created equally is that they do not all contain valid, and accurate meta-data, it the meta-data containers available in the PDF standard. PDF generators do not all respectfully add meta-data to right places in a PDF file, and those which do sometimes add meta-data to a PDF file do not always add the correct meta-data to the PDF.

Prepressure.com presents some clear discussion on the embedded properties of meta-data in PDFsPreppressure.com has a really helpful section on PDFs and various issues pertaining to PDFs and their use. http://www.prepressure.com/pdf/basics.
Their discussion on meta-data can be found at http://www.prepressure.com/pdf/basics/metadata.

How metadata are stored in PDF files

There are several mechanisms available within PDF files to add metadata:

The Info Dictionary has been included in PDF since version 1.0. It contains a set of document info entries, simple pairs of data that consist of a key and a matching value. Some of these are predefined, such as Title, Author, Subject, Keywords, Created (the creation date), Modified (the latest modification date) and Application (the originating application or library). Applications can add their own sets of data to the info dictionary.

XMP (Extensible Metadata Platform) is an Adobe technology for embedding metadata into files. It can be used with a wide variety of data files. With Acrobat 5 and PDF 1.4 (2001) this mechanism was also made available for PDF files. XMP is more powerful than the info dictionary, which is why it is used in a number of PDF-based metadata standards.

Additional ways of embedding metadata are the PieceInfo Dictionary (used by Illustrator and Photoshop for application specific data when you save a file as a PDF), Object Data (or User Properties) and Measurement Properties.

PDF metadata standards

There are a number of interesting standards for enriching PDF files with metadata. Below is a short summary:

There are PDF substandards such as PDF/X and PDF/A that require the use of specific metadata. In a PDF/X-1a file, for example, there has to be a metadata field that describes whether the PDF file has been trapped or not.

The GWG ad ticket provides a standardized way to include advertisement metadata into a PDF file.

Certified PDF is a proprietary mechanism for embedding metadata about preflighting – whether a PDF file intended to be printed by a commercial printer or newspaper has been properly checked on the presence of all fonts, images with a sufficient resolution,…

The filename is metadata as well

The easiest way to add information about a PDF to the file is by giving it a proper filename. A name like ‘SmartGuide_12_p057-096_v3.pdf’ tells a recipient much more about what the file is about than ‘pages_part2_nextupdate.pdf’ does.

Add the name of the publication and possibly the edition to the filename.

Add a revision number (e.g. ‘v3′) if there will be multiple updates of a file.

If a file contains part of the pages of a publication add at least the initial folio to the filename. That allows people to easily sort files in the right order. Use 2 or 3 digits for the page number (e.g. ‘009′ instead of just ‘9′).

Do not use characters that are not supported in other operating systems or that have a special meaning in some applications: * < > [ ] = + ” \ / , . : ; ? % # $ | & •.

Do not use a space as the first or last character of the filename.

Don’t make the filename too long. Once you go beyond 50 characters or so people may not notice the full information or the filename may get clipped in browser windows or applications.

Many prepress workflow systems can automatically insert files into a job based on a specific naming convention. This speeds up the processing of the job and can avoid costly mistakes. Consult with your printer – they may have guidelines for submitting files.

Even on my favorite operating system, OS X there are several methods available to users for making PDFs of documents. These methods do not all create the same PDFs. (The difference is in the meta-data contained and in the size of the files.) This is pointed out by Rob Griffiths ^[2]Rob Griffiths. Keep some PDF info private. Macworld.com. Mar 1, 2007 2:00 am. <Accessed 14 March 2011>. [Link] in Macworld in an article on privacy, and being aware of PDF meta-data which might transmit more personal information than the document creator might desire. However, what Rob points out is that there are several methods of producing PDFs on OS X and these various methods include or exclude various meta-data details. Just as privacy concerns might motivate the removal of embedded meta-data (or perhaps the creation of PDF without meta-data), the accuracy of archive quality should drive the inclusion of meta-data in PDF files hosted by archives. There are two obvious ways to increase the quality of a PDF in an archive:

The individual can enrich the PDF with meta-data prior to submission (risking that the institution will strip the meta-data embedded and input their own meta-data values)
The archive can systemically enrich the meta-data based on the other meta-data collected on the file while it is in their “custody”.

As individuals we can take responsibility for the first point. There are several open source tools for editing the embedded meta-data, one of these is pdfinfoAnother command line tool is ExifTool (Link to Wikipedia). ExifTool is more versatile, working with more file types than just PDF, but again this tool does not have a GUI.. I wish I could find a place to download this command line tool, but it only seems to be in linux software repositories. However, there are several other command line packages which incorporate this utility. One of these packages is xpdf. Xpdf is available under GPL for personal use from foolabs. The code has to be compiled from source but there are links to several other websites with compiled versions for various OSes. There is an OS package installer available from phg-online.de. For those of us who are strong believers in GUIs and loath the TUI (Text User Interface, or command line) there is a freely available GUI for pdfinfo from sybrex.com.

Because I use PDFs extensively in matters of linguistic research I thought that I would take look at several PDFs from a variety of sources. This would include:

JSTOR: Steele (1976) ^[3]Susan M. Steele. 1976. A Law of Order: Word Order Change in Classical Aztec. International Journal of American Linguistics, vol 42 (1): 31-45. [Link] . JSTOR is well known archive in academic circles (especially the humanities).
Project Muse: (Language) Ladefoged (2007) ^[4]Peter Ladefoged. 2007. Articulatory Features for Describing Lexical Distinctions. Language 83.1: 161-80. . Project Muse is also another well known repository for the humanities. Langauge is a well respected journal in the linguistic sciences, published by the Linguistic Society of America.
Cambridge Journals: (Journal of the International Phonetic Association) Olson, Mielke, Olson, Sanicas-Daguman, Pebley and Paterson (2010) ^[5]Kenneth S. Olson, Jeff Mielke, Josephine Sanicas-Daguman, Carol Jean Pebley & Hugh J. Paterson III. 2010. The phonetic status of the (inter)dental approximant. Journal of the International … Continue reading Cambridge Press, of which Cambridge Journals is a part, is a major publisher of linguistic content in the English academic community.
SIL Academic Publishing: Gardner and Merrifield (1990) ^[6]Richard Gardner and William R. Merrifield. 1990. Quiotepec Chinantec tone. In William R. Merrifield and Calvin R. Rensch (eds.), Syllables, tone, and verb paradigms: Studies in Chinantec languages 4, … Continue reading This PDF is found through the SIL Bibliography, but prepared by Academic Publishing (department) of SIL.It is important to note that this work was made available through SIL’s Global Publishing Service (formerly Academic publishing) not through the Language and Culture Archives. This is evidenced by the acpub used in the URL for accessing the actual PDF: www.sil.org/acpub/repository/24341.pdf. As a publishing service, this particular business unit of SIL is more apt to be aware of and use higher PDF standards like PDF/A in their workflows.
SIL – Papua New Guinea: Barker and Lee (n.d.) ^[7] Fay Barker and Janet Lee. Available: 2009; Created: n.d.. A tentative phonemic statement of Waskia. [Manuscript] 40 p. [Link]. but made available online in 2009 by SIL – Papua New Guinea.
SIL Mexico Branch: Benito Apolinar Antonio, et al. MWP#9a ^[8]Benito Apolinar Antonio, et al. 2010. Vocabulario básico en me’phaa. SIL-Mexico Electronic Working Papers #9a. [PDF]. and Benito Apolinar Antonio, et al. MWP#9b ^[9]Benito Apolinar Antonio, et al. 2010. Vocabulario básico en me’phaa. SIL-Mexico Electronic Working Papers #9b. SIL International. [PDF]. It is interesting to note that the production tool used to create the PDFs for the Mexico Branch Work Papers was XLingPaper. ^[10]H. Andrew Black. 2009. Writing linguistic papers in the third wave. SIL Forum for Language Fieldwork 2009-004:11. http://www.sil.org/silepubs/abstract.asp?id=52286. [PDF] ^[11]H. Andrew Black, and Gary F. Simons. 2009. Third wave writing and publishing. SIL Forum for Language Fieldwork 2009-005: 15 http://www.sil.org/silepubs/abstract.asp?id=52287. [PDF] XLingPaper is a plugin for XMLMind, an XML editor. It is used for creating multiple products from a single XML data source. (In this case the data source is the linguistics paper.) However, advanced authoring tools like XLingPaper, LaTeX and its flavors like XeTeX should be able to handel assignment of keywords and meta-data on the creation fo the PDF.
Example of a PDF made from Microsoft Word: Snider (2011) ^[12]Keith Snider. 2011. On Discovering Contrastive Tone Melodies. Paper presented at the Berkley Tone Workshop, 18-20 February 2011, University of California, Berkley.
Example of a PDF made from Apple Pages: Paterson and Olson (2009) ^[13]Hugh Paterson III and Kenneth Olson. 2009. An unlikely retention. Paper presented at the 11th International Conference on Austronesian Linguistics, 22–26 June 2009, Aussois, France.

The goal of the comparison is to look at a variety of PDF resources from a variety of locations and content handlers. I have included two linguistic journals, two repositories for journals, and several items from various SIL outlets. Additionally, I have included two different PDFs which were authored with popular wordprocessing applications. To view the PDFs and their meta-data I used Preview, a PDF and media viewer which ships with OS X, and is created by Apple. Naturally, the scope of the available meta-data to be viewed is limited to what Preview is programed to display. Adobe Acrobat Pro will display more meta-data fields in its meta-data editing interface.

JSTOR:

Using Preview in OS X to look at the embedded meta-data in a PDF from JSTOR.
Project Muse:

Using Preview on OS X to look at the embedded meta-data of a PDF from Project Muse and the journal Langauge
Cambridge Journals:

Using Preview on OS X to look at the embedded meta-data of a PDF from Cambridge Journals and the Journal of the International Phonetic Association
SIL Academic Publishing (not the archive):

Using Preview on OS X to look at the embedded meta-data of a PDF as it was Prepared by Academic Publishing

A close up view of the Keywords meta-data of as it was Prepared by Academic Publishing. Academic Publishing was the only one in the set of PDFs surveyed to use Keywords. They were also the only one to use or embed the ISO 639-3 code of the subject language of the item.
Among the PDFs surveyed Academic Publishing was the only producer to use Keywords. They were also the only one to use or embed the ISO 639-3 code of the subject language of the item.
SIL – Papua New Guinea:

Using Preview on OS X to look at the embedded meta-data of a PDF prepared by SIL - Papua New Guinea
SIL Mexico Branch:
Work Papers #9a

Using OS X to look at the embedded meta-data of a PDF prepared by SIL - Mexico Branch

Work Papers #9b

Using Preview on OS X to look at the embedded meta-data of a PDF prepared by SIL - Mexico Branch
MS Word Example:

Using Preview on OS X to look at the embedded meta-data of a PDF prepared by an Individual using MS Word
Notice that in the title that the application used to create the PDF inserts “Microsoft Word – ” Before the document title.
Apple Pages Example:

Pages Document Inspector showing where one can edit the meta-data which will be passed to the PDF when created using the Export option.

As we can see from the images presented here there is not a wide spread adoption of a systematic process on the part of:

publishers
or on the part of developers of writing utilities, like MS Word, or XLingPaper, to encourage the enduser to produce enriched PDFs.
Additionally, there is not a systemic process used by content providers to enrich content produced by publishers.

However, enriched content (PDFs) is used by a variety of PDF management applications and citation management software. That is, consumers do benefit from the enriched state of PDFs and consumers are looking for these featuresThe discussion on Yep 2’s forums high-lights this point. Yep 2 is a consumer / Desktop media & PDF management tool. There are several other tools out there like Papers2, Mendeley, Zotero even Endnote..

If I were to extend this research I would look at PDFs from more content providers. I would look for a PDF from an Open Access Repository like the Rugters Optimality Archive, a Dissertation from ProQuest, I would also look for some content from a reputable archive like PARADISEC, and something from a DSpace implementationXpdf can be used in conjunction with DSpace, in fact it is even mentioned in the manual..