This looks awesome. I'll have to remember this for those situations where I am looking to embed metadata.
A User Experience look at Linguistic Archiving
In a recent paper Jeremy Nordmoe, a friend and colleague, states that:
Because most linguists archive documents infrequently, they will never be experts at doing so, nor will they be experts in the intricacies of metadata schemas.
My initial reply is:
You are d@#n right! and it is because archives are not sexy enough!
A document’s DOI (http://www.doi.org/ or on Wikipedia under Digital Object Identifier) is an important part of the citation of a document. Many style sheets allow for just the DOI of a paper as the citation. Because DOIs are unique they can act as URIs which are resolvable and look like URLs. However, a DOI is different than a URL for where a digital object might be located. It might be well argued that a DOI should be tracked in the metadata schemes of archives which collect language and linguistic data.
I was looking at the wikipedia article for Language Documentation. The only reference cited was a thesis by Debbie Chang. I happen to know Debbie. So I thought I would take a look at her thesis and see what she said. So I clicked the link and was delivered to a 404 error page on GIAL’s website. GIAL had recently renovated their website. I was able to locate thesis and fix the URL on wikipedia by digging through the GIAL website. The new URL is: http://www.gial.edu/images/theses/Chang_Debbie-thesis.pdf
But then I looked at the URL and asked: Why are PDFS in the images folder? What is the long term infrastructure for this school? It seems that when PDFs (thesis) are put into the images folder rather than into a digital repository that something is not quite right with the longterm planning for the school. Ironically, this is not too far from the main thrust of Debbie’s thesis.
It would seem that the long term solution for this kind of problem would be for a small school like GAIL to A. have its library develop an infrastructure for permanently housing these kinds of materials. Or B. contract with another organization or archive which could take care of these sorts of issues for them, provide handles or stable URLs, and then for GIAL to link to the permanent location of these items from GIAL’s website. It is interesting to note that on the same campus as GIAL is SIL International’s Language and Culture Archive, yet GIAL has not taken advantage of this opportunity.
Is what you say what you want really what you want?
I am involved in an operation which is tasked with digitizing content created by SIL staff in the Americas. All 80 or so years of history. The end goal is to make the items accessible and usable as widely as possible (there are a lot of factors which dictate how wide, wide truly is). Today I came across an item which was created at the end of 2008. It was "born digital" that is, it was created on a computer. As such it should not need to be scanned if the digital production file can be located. Unfortunately, this is not the only item in its class. There are quite a few items in the line up to be scanned which have been born digital in the last few years. It would help us to understand a little bit about the item in question to fully realize this scenario.
Here was the process for creating the item in Dec. 2008:
- Item was created in a .txt / xml environment.
- The text was flowed through a page layout process and put into a PDF.
- The PDF was taken to a printer and printed.
- A copy of the printout was presented to the Language and Culture Archive
So there should be a .txt/xml type file (valid archival format) for this item, and there should be a PDF for this item (also possibly an archival format). Neither of these files has been submitted to the archive at SIL International nor does the SIL Area archiving staff have a definitive recourse to acquire the file.
To understand some of the impact of this statement it is important to understand some of the corporate history and the corporate structure (with a hint of corporate culture).
SIL's history is as one organization, which started in Mexico. Through time the founders also started what might be best classified as sister organization with the same name in various countries. Again with the passage of time an organization was conceived which needed to support and in some ways "unify" the various sister organizations. This cover organization is known as SIL International. These management structures, or their vestiges still exist today. Though in recent times expatriate staff have been returning from working within host countries and overall staff counts have been in decline (particularly in the Americas). So as branches (these former sister companies) have folded, they have folded into a larger management structure called an Area. These branches retain a rather autonomous position (in management practice and in goal setting and policy), while being connected and dedicated at some level to the larger overarching stated goals of SIL International. Yet an individual might be underThis is not a universally understood concept. That is, the alternative perspectives Is an SIL staff person there for the needs of the company or is the company there to serve the individual? are still a disputed issue in the minds of many people serving with SIL. the administration of any of these administrative structures.
This history has left the archiving practice in an interesting managerial arrangement. Former branches which have folded into the area are often called regions and are administered by a regional director. This might be illustrated by the following diagram.
An alternative organization method would be to organize around the content of the task. That is illustrated in the lower right of the above diagram by grouping all of the archivist together administratively and marketing their operations as a service. However, discussion of that sort of organizational change is beyond the scope of this post.
As things stand currently though, the operational goal of this project is to make content accessible and usable to end users. More use cases are able to be solved if archivable formats are used and the objects collected are actually those same digitally created objects. However, managerial success on the project is measured by how many scans are made of products in the Americas Area's reach, rather than the quality of the items that the archive is able to put into the hands of end users. So for these items which were born digital, because we do not have a recourse to pursue the file we will scan the item. We will also then "clean up" the item and make it into .tiff files and a PDF (a sum of about 5 hours of work for every 100 pages). Now is the original digital item out of reach of our pursuit? Well, there is one more structure which is needed to be understood so that this can be fully realized.
In this diagram the area director has the mandate to secure all property belonging to the SIL organizational/business unit including intellectual property. This part of his responsibility has been delegated to one of his subordinates, the Support Services Director. The Support Services Directer manages the staff providing services to the Language Program People. But in the Americas Area, Language Program personnel are trained not to respond to persons who are not in their direct chain of supervisors. This means that the area Archive Coordinator has to coordinate with the Language Programs Director to get a request to the appropriate field person. It also means that the person working in the field is not responsible to archive their work (because this part of the mandate is viewed to be fulfilled by the archive coordinator).This leads to some interesting problems in terms of managing intellectual property. Intellectual property accountability and human resource accountability are not as highly ranked as financial accountability. These can be inherently difficult aspects of any business to manage, let alone a Not-for-Profit organization. It would be interesting if IP and HR resources could be evaluated like finances are by the ECFA. It would seem that in the SIL family of organizations that there is a corporate value/culture to not value intellectual property. In terms of market economy, intellectual property is generally not viewed as being monetizable. Therefore, the products containing the IP are also not worth more than the moment's task. This is possibly in part because the organization is a relationally motivated organization and not a data driven organization. There are several ways that this disjunct can be viewed. One of them is that there should be a data planThis data plan would include archiving, backup, and distribution. as part of the project plan before funding for the plan is provided. Additionally, a separate but related plan should be implemented to cover IP issues, copyright issues, and the licensing and use of data, and products. By pushing this to the project planing level it puts the burden on the project doers to meet the requirements for funding. This model is often used in European Union financed research projects. In 2011 the National Science Foundation in the U.S. also required a data management plan to be submitted with grants being applied for. It is interesting that SIL International's funders do not require this to be part of the project planning.
However, having a data management plan does not cover the above use case completely. The project did submit a physical object to the archive at one point. The problem here is the continued access to an ongoing project by services being performed in one part of the company to individuals in another part of the company. This is a management and service integration issue. Because there is a perception that management is too busy or that this is not a high enough priority for them to act on in a timely manner, then it costs the archiving department 5-6 man hours when all that might be needed is 10-20 minutes of email time. But being efficient, or providing a higher quality product which is more usable and has a smaller digital foot print does not come in the the matrix for evaluating results. Seems to me to be a process design FAIL.
Working in an archive, I deal with a lot of metadata. Some of this metadata is from controlled vocabularies. Sometimes they show up in lists. Some times these controlled vocabularies can be very large, like for the names of language where there are a limited amount of languages but the amount is just over 7,000. I like to keep an eye out for how websites optimized the options for users. FaceBook, has a pretty cool feature for narrowing down the list of possible family relationships someone has to you. i.e. a sibling could be a brother/sister, step-brother/step-sister, or a half-brother/half-sister. But if the sibling is male, it can only be a brother, step-brother, or a half-brother.
FaceBook narrows the logical selection down based on atributes of the person mentioned in the relationship.
That is if I select Becky, my wife, as an person to be in a relationship with me then FaceBook determines that based on her gender atribute that she can only be referenced by the female relationships.
The company I work for has an archive for many kinds of materials. In recent times this company has moved to start a digital repository using DSpace. To facilitate contributions to the repository the company has built an Adobe AIR app which allows for the uploading of metadata to the metadata elements of DSpace as well as the attachement of the digital item to the proper bitstream. Totally Awesome.
However, one of the challenges is that just because the metadata is curated, collected and properly filed, it does not mean that the metadata is embedded in the digital items uploaded to the repository. PDFs are still being uploaded with the PDF’s author attribute set to Microsoft-WordMore about the metadata attributes of PDF/A can be read about on pdfa.org. Not only is the correct metadata and the wrong metadata in the same place at the same time (and being uploaded at the same time) later, when a consumer of the digital file downloads the file, only the wrong metadata will travel with the file. This is not just happening with PDFs but also with .mp3, .wav, .docx, .mov, .jpg and a slew of other file types. This saga of bad metadata in PDFs has been recognized since at least 2004 by James Howison & Abby Goodrum. 2004. Why can’t I manage academic papers like MP3s? The evolution and intent of Metadata standards.
So, today I was looking around to see if Adobe AIR can indeed use some of the available tools to propagate the correct metadata in the files before upload so that when the files arrive in DSpace that they will have the correct metadata.
- The first step is to retrieve metadata from files. It seems that Adobe AIR can do this with PDFs. (One would hope so as they are both brain children of the geeks at Adobe.) However, what is needed in this particular set up is a two way street with a check in between. We would need to overwrite what was there with the data we want there.
- However, as of 2009, there were no tools in AIR which could manipulate exif Data (for photos).
- But it does look like the situation is more hopeful for working with audio metadata.
Three Lingering Thoughts
- Even if the Resource and Metadata Packager has the abilities to embed the metadata in the files themselves, it does not mean that the submitters would know about how to use them or why to use them. This is not, however, a valid reason to not include functionality in a development project. All marketing aside, an archive does have a responsibility to consumers of the digital content, that the content will be functional. Part of today’s “functional” is the interoperability of metadata. Consumers do appreciate – even expect – that the metadata will be interoperable. The extra effort taken on the submitting end of the process, pays dividends as consumers use the files with programs like Picasa, iPhoto, PhotoShop, iTunes, Mendeley, Papers, etc.
- Another thought that comes to mind is that When one is dealing with large files (over 1 GB) It occurs to me that there is a reason for making a “preview” version of a couple of MB. That is if I have a 2 GB audio file, why not make 4 MB .mp3 file for rapid assessment of the file to see if it is worth downloading the .wav file. It seems that a metadata packager could also create a presentation file on the fly too. This is no-less true with photos or images. If a command-line tool could be used like imagemagick, that would be awesome.
- This problem has been addressed in the open source library science world. In fact a nice piece of software does live out there. It is called the Metadata Extraction Tool. It is not an end-all for all of this archive’s needs but it is a solution for some needs of this type.
Many of us use PDFs every day. They’re great documents to work with and read from because of their ease of use and ease to create.
I think I started to use PDFs for the first time in 2004. That’s when I got my first computer. Since that time, most PDFs I have needed to use have just worked. In the time that I have been using PDFs I have noticed that there are at least two major ways in which PDFs are not created equally:
- Validity of the PDF: Adherence to the PDF document standard.
- Resolution of contained images
- The presence and accuracy of the PDF’s meta-data.
Since 2004, there have only been a few PDFs which after creation and distribution would not render by any of my PDF readers, or on the readers my friends used (most of these PDFs were created by Microsoft Word or Microsoft Publisher on Windows and actually one or two created by Apple’s word processor Pages). Sometimes these errors had to do with a particular image included in the source document. The image may have been malformed, but this was not always the case. Sometimes it was the PDF creator, which was creating non-cross-platform PDFs.
Not all PDFs are created equal. (This is inherently true when one considers the PDF/A The University of Michigan put a small flyer together on how to get something like a PDF/A to print from MS Word on OS X and Windows. [Link], and PDF/X standards, however lets side-step those standards for a moment.) To frame this discussion, it is necessary to acknowledge that there is a difference between creating a digital document with a life expectancy of 3 weeks and one with a life expectancy of 150 years. So for some applications, what I am about to say is a moot point. However, looking towards the long term…
If an archival institution wants a document as a PDF, what are the requirements that that PDF needs to have?
What if the source document is not in Unicode? Is the font used in the source document automatically embedded in the PDF upon PDF creation? Consider this from PDFzone.com
Embedding Fonts in a PDF
Another common area of complaint among frequent PDF users is font incompatibility and problems with font embedding. Here are some solutions and tips for putting your best font forward so to speak.
Keep in mind that when it comes to embedding fonts in a PDF file you have to make certain that you have the correct fonts on the system you’re using to make the conversion. Typically you should embed and subset fonts, although there are always exceptions.
If you just need a simple solution that will handle the heavy font work for you, the WonderSoft Virtual PDF Printer helps you choose and embed your fonts into any PDF. The program supports True Type and Unicode fonts.
The left viewing window shows you all the fonts installed on your system and the right viewing window shows the selected user fonts to embed into a newly created PDF form. A single license is $89.95.
Another common solution is the 3-Heights Optimization PDF Optimization Tool [Link Removed].
One of the best sources of information on all things font is at the Adobe site itself under the Developer Resources section.
- Embedded PDF executable hack goes live in Zeus malware attacks
- Hacker finds a way to exploit PDF files, without a vulnerability
- Escape From PDF
- Adobe suggests workaround for PDF embedded executable hack
- CVE-2010-1240 : Adobe PDF Embedded EXE Social Engineering
Printing the PDF does not seem to be a fail proof method to see if the PDF is valid or even usable. See this write up from The University of Sydney School of Mathematics and Statistics:
I can read the file OK on screen, but it doesn’t print properly. Some characters are missing (often minus signs) and sometimes incorrect characters appear. What can I do?
When the Acrobat printing dialog box opens, put a tick in the box alongside “print as image”. This will make the printing a lot slower, but should solve the problem. (The “missing minus signs” problem seemed to occur for certain – by now rather old – HP printers.)
(Most of these problems with pdf files are caused by subtle errors with the fonts the pdf file uses. Unfortunately, there are only a limited number of fonts that supply the characters needed for files that involve a lot of mathematics.)
Printing a PDF is not necessarily a fail proof way to see if a PDF is usable. Even if the PDF is usable, printing the PDF does not ensure that it is a valid PDF either. When considering printing as a fail proof method one should also consider that PDFs can contain video, audio, and flash content. So how is one to print this material? Or in an archival context determine that the PDF is truly usable? A valid PDF will render and archive correctly because it conforms to the PDF standard (what ever version of that standard is declared in the PDF). Having a PDF conform to a given PDF standard puts the onus on the creator of the PDF viewer (software) to implement the PDF standard correctly. Thus making the PDF usable (as intended by the PDF’s author).
Note: I have not checked the Digital Curation Center for any recommendations on PDFs and ensuring their validity on acceptance to an archive.
Resolution of Contained Images
A second way that PDF documents can vary is that the resolutions of the images contained in them can vary considerably. The images inside of a PDF can be a variety of image formats, .jpg, .tiff, .png, etc. So the types of compression and the looseness of these compressions can make a difference in the archival “quality” of a PDF. A similar difference is is noted to be the difference in a raster PDF and a Vector PDF. Beside these two types of differences there are various PDF printers, which print materials to PDF in various formats. This manual discusses setting Adobe Acrobat Pro’s included PDF printer.
A third way in which PDFs are not created equally is that they do not all contain valid, and accurate meta-data, it the meta-data containers available in the PDF standard. PDF generators do not all respectfully add meta-data to right places in a PDF file, and those which do sometimes add meta-data to a PDF file do not always add the correct meta-data to the PDF.
Prepressure.com presents some clear discussion on the embedded properties of meta-data in PDFsPreppressure.com has a really helpful section on PDFs and various issues pertaining to PDFs and their use. http://www.prepressure.com/pdf/basics.
Their discussion on meta-data can be found at http://www.prepressure.com/pdf/basics/metadata.
How metadata are stored in PDF files
There are several mechanisms available within PDF files to add metadata:
- The Info Dictionary has been included in PDF since version 1.0. It contains a set of document info entries, simple pairs of data that consist of a key and a matching value. Some of these are predefined, such as Title, Author, Subject, Keywords, Created (the creation date), Modified (the latest modification date) and Application (the originating application or library). Applications can add their own sets of data to the info dictionary.
- XMP (Extensible Metadata Platform) is an Adobe technology for embedding metadata into files. It can be used with a wide variety of data files. With Acrobat 5 and PDF 1.4 (2001) this mechanism was also made available for PDF files. XMP is more powerful than the info dictionary, which is why it is used in a number of PDF-based metadata standards.
- Additional ways of embedding metadata are the PieceInfo Dictionary (used by Illustrator and Photoshop for application specific data when you save a file as a PDF), Object Data (or User Properties) and Measurement Properties.
PDF metadata standards
There are a number of interesting standards for enriching PDF files with metadata. Below is a short summary:
- There are PDF substandards such as PDF/X and PDF/A that require the use of specific metadata. In a PDF/X-1a file, for example, there has to be a metadata field that describes whether the PDF file has been trapped or not.
- The GWG ad ticket provides a standardized way to include advertisement metadata into a PDF file.
- Certified PDF is a proprietary mechanism for embedding metadata about preflighting – whether a PDF file intended to be printed by a commercial printer or newspaper has been properly checked on the presence of all fonts, images with a sufficient resolution,…
The filename is metadata as well
The easiest way to add information about a PDF to the file is by giving it a proper filename. A name like ‘SmartGuide_12_p057-096_v3.pdf’ tells a recipient much more about what the file is about than ‘pages_part2_nextupdate.pdf’ does.
- Add the name of the publication and possibly the edition to the filename.
- Add a revision number (e.g. ‘v3′) if there will be multiple updates of a file.
- If a file contains part of the pages of a publication add at least the initial folio to the filename. That allows people to easily sort files in the right order. Use 2 or 3 digits for the page number (e.g. ‘009′ instead of just ‘9′).
- Do not use characters that are not supported in other operating systems or that have a special meaning in some applications: * < > [ ] = + ” \ / , . : ; ? % # $ | & •.
- Do not use a space as the first or last character of the filename.
- Don’t make the filename too long. Once you go beyond 50 characters or so people may not notice the full information or the filename may get clipped in browser windows or applications.
- Many prepress workflow systems can automatically insert files into a job based on a specific naming convention. This speeds up the processing of the job and can avoid costly mistakes. Consult with your printer – they may have guidelines for submitting files.
Even on my favorite operating system, OS X there are several methods available to users for making PDFs of documents. These methods do not all create the same PDFs. (The difference is in the meta-data contained and in the size of the files.) This is pointed out by Rob Griffiths in Macworld in an article on privacy, and being aware of PDF meta-data which might transmit more personal information than the document creator might desire. However, what Rob points out is that there are several methods of producing PDFs on OS X and these various methods include or exclude various meta-data details. Just as privacy concerns might motivate the removal of embedded meta-data (or perhaps the creation of PDF without meta-data), the accuracy of archive quality should drive the inclusion of meta-data in PDF files hosted by archives. There are two obvious ways to increase the quality of a PDF in an archive:
- The individual can enrich the PDF with meta-data prior to submission (risking that the institution will strip the meta-data embedded and input their own meta-data values)
- The archive can systemically enrich the meta-data based on the other meta-data collected on the file while it is in their “custody”.
As individuals we can take responsibility for the first point. There are several open source tools for editing the embedded meta-data, one of these is pdfinfoAnother command line tool is ExifTool (Link to Wikipedia). ExifTool is more versatile, working with more file types than just PDF, but again this tool does not have a GUI.. I wish I could find a place to download this command line tool, but it only seems to be in linux software repositories. However, there are several other command line packages which incorporate this utility. One of these packages is xpdf. Xpdf is available under GPL for personal use from foolabs. The code has to be compiled from source but there are links to several other websites with compiled versions for various OSes. There is an OS package installer available from phg-online.de. For those of us who are strong believers in GUIs and loath the TUI (Text User Interface, or command line) there is a freely available GUI for pdfinfo from sybrex.com.
Because I use PDFs extensively in matters of linguistic research I thought that I would take look at several PDFs from a variety of sources. This would include:
- JSTOR: Steele (1976). JSTOR is well known archive in academic circles (especially the humanities).
- Project Muse: (Language) Ladefoged (2007). Project Muse is also another well known repository for the humanities. Langauge is a well respected journal in the linguistic sciences, published by the Linguistic Society of America.
- Cambridge Journals: (Journal of the International Phonetic Association) Olson, Mielke, Olson, Sanicas-Daguman, Pebley and Paterson (2010) Cambridge Press, of which Cambridge Journals is a part, is a major publisher of linguistic content in the English academic community.
- SIL Academic Publishing: Gardner and Merrifield (1990) This PDF is found through the SIL Bibliography, but prepared by Academic Publishing (department) of SIL.It is important to note that this work was made available through SIL’s Global Publishing Service (formerly Academic publishing) not through the Language and Culture Archives. This is evidenced by the acpub used in the URL for accessing the actual PDF:
www.sil.org/acpub/repository/24341.pdf. As a publishing service, this particular business unit of SIL is more apt to be aware of and use higher PDF standards like PDF/A in their workflows.
- SIL – Papua New Guinea: Barker and Lee (n.d.) but made available online in 2009 by SIL – Papua New Guinea.
- SIL Mexico Branch: Benito Apolinar Antonio, et al. MWP#9a and Benito Apolinar Antonio, et al. MWP#9b It is interesting to note that the production tool used to create the PDFs for the Mexico Branch Work Papers was XLingPaper. XLingPaper is a plugin for XMLMind, an XML editor. It is used for creating multiple products from a single XML data source. (In this case the data source is the linguistics paper.) However, advanced authoring tools like XLingPaper, LaTeX and its flavors like XeTeX should be able to handel assignment of keywords and meta-data on the creation fo the PDF.
- Example of a PDF made from Microsoft Word: Snider (2011)
- Example of a PDF made from Apple Pages: Paterson and Olson (2009)
The goal of the comparison is to look at a variety of PDF resources from a variety of locations and content handlers. I have included two linguistic journals, two repositories for journals, and several items from various SIL outlets. Additionally, I have included two different PDFs which were authored with popular wordprocessing applications. To view the PDFs and their meta-data I used Preview, a PDF and media viewer which ships with OS X, and is created by Apple. Naturally, the scope of the available meta-data to be viewed is limited to what Preview is programed to display. Adobe Acrobat Pro will display more meta-data fields in its meta-data editing interface.
- Project Muse:
- Cambridge Journals:
- SIL Academic Publishing (not the archive):
Among the PDFs surveyed Academic Publishing was the only producer to use Keywords. They were also the only one to use or embed the ISO 639-3 code of the subject language of the item.
- SIL – Papua New Guinea:
- SIL Mexico Branch:
Work Papers #9a
Work Papers #9b
- MS Word Example:
Notice that in the title that the application used to create the PDF inserts “Microsoft Word – ” Before the document title.
- Apple Pages Example:
As we can see from the images presented here there is not a wide spread adoption of a systematic process on the part of:
- or on the part of developers of writing utilities, like MS Word, or XLingPaper, to encourage the enduser to produce enriched PDFs.
- Additionally, there is not a systemic process used by content providers to enrich content produced by publishers.
However, enriched content (PDFs) is used by a variety of PDF management applications and citation management software. That is, consumers do benefit from the enriched state of PDFs and consumers are looking for these featuresThe discussion on Yep 2’s forums high-lights this point. Yep 2 is a consumer / Desktop media & PDF management tool. There are several other tools out there like Papers2, Mendeley, Zotero even Endnote..
If I were to extend this research I would look at PDFs from more content providers. I would look for a PDF from an Open Access Repository like the Rugters Optimality Archive, a Dissertation from ProQuest, I would also look for some content from a reputable archive like PARADISEC, and something from a DSpace implementationXpdf can be used in conjunction with DSpace, in fact it is even mentioned in the manual..