This post is a open draft! It might be updated at any time... But was last updated on at .
Metadata is very important - Everyone agrees. However, there is some discussion when it comes to how to develop metadata and also how to ensure that the metadata is accurate. Taxonomies are limited vocabularies (a set number of items) where each term has a predefined definition. A folksonomy is a vocabulary where people, usually users of data, assign their own useful words or metadata to an item. Folksonomies are like taxonomies in that they are both sets but are unlike taxonomies in the sense that they are an open set where taxonomies are closed sets.
An example of a taxonomy might be the colors of a traffic light: Red, Yellow, and Green. If this were a folksonomy people might suggest also the colors of Amber, Orange, Blue-Green and Blue. These additional terms may be accurate to some viewers of traffic lights or in some cases but they do not fit the stereo-typical model for what are the colors of traffic lights. Continue reading →
. If it does then is it going to be able to use WordPress to advertise things or is it going to use WordPress to aggregate things? if the former then nothing out there ever let the admin user choose which fields were matched to which attributes, dynamically. But if it is also the former then why would anyone actually want this functionality? What is the Use Case? If one is using WordPress as a bibliography reference system like some libraries do, then this makes a lot of sense. However, there is another use case I would like to present. That is, the website which is about several or a single language. There are potentially two ways to conceptualize this:
January 4-5, 2012, I had the opportunity to participate in the LSA's Satellite Workshop for Sociolinguistic Archival Preparation in Portland, Oregon. There were a great many things I learned there. So here are only a few thoughts.
Part of the discussion at the workshop was on how we can make corpora which are collected by Sociolinguists available to the larger Sociolinguistic community. In particular the discussion I am referencing revolved around the standardisation of metadata in the corpora. (In the discussion it was established that are two levels of metadata, "event level" and "corpus level".) While OLAC gives us some standardization about the corpus level metadata, the event metadata is still unique to each investigation, and arguably this is necessary. However, it was also pointed out that not all "event level" metadata need to be encoded or tracked uniquely. That is, data like date of recording, name of participants, location of recording, gender (male/female) of participant, can all be regularized across the community.
With the above as preface, it is important to realize that we do need to understand that there are still various kinds of metadata which need to be collected. In the workshop it was acknowledged that the field of language documentation was about 10 years ahead of this community of sociolinguists.What was not well defined in the workshop was what the distinction is between a language documentation corpus and a sociolinguistics corpus. It seems to me as a new practitioner that the chief difference between these two types of corpora is the self identifying quality of researcher. That is does the researcher self-identify as a Sociolinguist or as a Language Documenter. Both types of corpora attempt to get at the vernacular, and both types of corpora collect sociolinguistic facts. It would seem that both corpora are essentially the same (give or take a few metadata attributes). So, I will take an example from the metadata write-up I did for the Meꞌphaa language documentation project. In that project we collected metadata about:
Equipment settings during recording
In the following diagram I illustrate the cross cutting of a corpus with these "kinds" of metadata. The heavier, darker line represents the corpus, while the medium heavy lines represent the "kinds" of metadata. Finally, the lighter lines represent the sub-kinds of metadata, where the sub-kinds might be the latitude, longitude, altitude, datum, country, and place name of the location.
Corpora metadata categories with some sub-categories
This does not mean that the corpus does not also need to be cross cut with these other "sub-kinds". However, these sub-kinds are significantly more in number and will very from project to project. Some of these metadata kinds will be collected in a speaker profile questionnaire. But some of these metadata can only be provided with reflection on the event. To demonstrate the cross cutting of these metadata elements on a corpus I have provided the following diagram. It uses categories which were mentioned in the workshop and is not intended to be comprehensive. In this second diagram, the cross cutting elements might themselves be taxonomies. They may have controlled vocabularies or they may have an open set of possible values, they may also represent a scale.
Taxonomies for social demographics and social dynamics for speakers in corpora
Both of these diagrams tend to illustrate what in this workshop were referred to a "event level" metadata, rather than "corpus level" metadata.
A note on corpus level metadata v.s. descriptive metadata
There is one more thing which I would like to say about "corpus level" metadata. Metadata is often separated out by function. That is what does the metadata allow us to do, or why is the metadata there?
I have been exposed to the following taxonomy of metadata types though course work and in working with photographs and images. These classes of metadata are also similar to those posted by JISC Digital Media as they approach issues with Metadata for digital audio.
Descriptive meta-data: supports discovery, attribution and identification of resources created.
Administrative meta-data: supports management, preservation, and appropriate usage of resources created.
Technical: About the machinery used to create the resource and the technical aspects of the resource.
Use and Rights: Copyright, license and moral ownership of the items.
Structural meta-data: maintains relationships between the parts of complex, multi-part resources (Spanne 2008).
Situational: this is metadata which describes the events around the creation of the work. Asking questions about the social setting, or the precursory events. It follows ideas put forward by Bergqvist (2007).
Use metadata: metadata collected from or about the users themselves (e.g. user annotations, number of people accessing a particular resource)
I think it is only fair to point out to archivist and to librarians that linguists and language documenters do not see a difference between descriptive and non-descriptive metadata in their workflows. That is sometimes we want to search all the corpora by licenses or by a technical attribute. This elevates the these attributes to the function of discovery metadata. It does not remove the function of descriptive metadata from its role in finding things but it does functionally mean that the other metadata is also viable as discovery metadata.
I have been following Learning Resource Metadata Initiative (LRMI), a collaborative effort between Creative Commons and the Association of Educational PublishersCreative Commons. 7 June 2011. Creative Commons & the Association of Educational Publishers to establish a common learning resources framework. http://creativecommons.org/weblog/entry/27603 . … Continue reading , with some interest as I start to look at SIL.org and potential services and resources offered through SIL.org are merged with the larger world of well described data.
SIL has a long tradition of providing linguistic training. With the digital revolution, it only seems right that these training resources would be described appropriately in the educational arena. It will be interesting to look at LRMI as it develops over the next few months. And then to think about applying it in the context of Drupal.
Creative Commons. 7 June 2011. Creative Commons & the Association of Educational Publishers to establish a common learning resources framework. http://creativecommons.org/weblog/entry/27603 . [Accessed: 27 November 2011] [Link]
I have recently been reading the blog of Martin Fenner and came upon the article Personal names around the world . His post is in fact a reflection on a W3C paper on Personal Names around the WorldSeveral other reflections are here: http://www.w3.org/International/wiki/Personal_names (same title). This is apparently coming out of the i18n effort and is an effort to help authors and database designers make informed decisions about names on the web.
I read Martin’s post with some interest because in Language Documentation getting someone’s name as a source or for informed consent is very important (from a U.S. context). Working in a archive dealing with language materials, I see lot of names. One of the interesting situations which came to me from an Ecuadorian context was different from what I have seen in the w3.org paper or in the w3.org discussion. The naming convention went like this:
The elder was known by the younger’s name plus a relationship.
My suspicion is that it is a taboo to name the dead. So to avoid possibly naming the dead, the younger was referenced and the the relationship was invoked. This affected me in the archive as I am supposed to note who the speaker is on the recordings. In lue of the speakers name, I have the young son’s first name, who is well known in the community, and is in his 30’s or so, and I have the relationship. So in English this might sound like John’s mother. Now what am I supposed to put in the metadata record for the audio recordings I am cataloging? I do not have a name but I do have a relationship to a known (to the community) person.
I inquired with a literacy consultant who has worked in Ecuador with indigenous people for some years, she informed me that in one context she was working in everyone knew what family line they were from and all the names were derived from that family line by position. It was of such that to call someone by there name was an insult.
It sort of reminds me of this sketch by Fry and Laurie.
Working in an archive, one can imagine that letting go of materials is a real challenge. Both in that it is hard to do becasue of policy, but also because it is hard to do because of the emotional “pack-rat” nature of archivist. This is no less the case of the archive where I work. We were recently working through a set of items and getting rid of the duplicates. (Physical space has its price; and the work should soon be available via JASOR.) However, one of the items we were getting rid of was a journal issue on a people group/language. The journal has three articles, of these, only one of them article was written by someone who worked for the same organization I am working for now. So the “employer” and owner-operator of the archive only has rights to one of the three works. (Rights by virtue of “work-for-hire” laws.) We have the the off-print, which is what we have rights to share, so we keep and share that. It all makes sense. However, what we keep is catalogued and inventoried. Our catalogue is shared with the world via OLAC. With this tool someone can search for a resource on a language, by language. It occurs to me that the other two articles on this people group/language will not show in the aggregation of results of OLAC. This is a shame as it would be really helpful in many ways. I wish there was a groundswell, open source, grassroots web facilitated effort where various researchers can go and put metadata (citations) of articles and then they would be added to the OLAC search.
The company I work for has an archive for many kinds of materials. In recent times this company has moved to start a digital repository using DSpace. To facilitate contributions to the repository the company has built an Adobe AIR app which allows for the uploading of metadata to the metadata elements of DSpace as well as the attachement of the digital item to the proper bitstream. Totally Awesome.
However, one of the challenges is that just because the metadata is curated, collected and properly filed, it does not mean that the metadata is embedded in the digital items uploaded to the repository. PDFs are still being uploaded with the PDF’s author attribute set to Microsoft-WordMore about the metadata attributes of PDF/A can be read about on pdfa.org. Not only is the correct metadata and the wrong metadata in the same place at the same time (and being uploaded at the same time) later, when a consumer of the digital file downloads the file, only the wrong metadata will travel with the file. This is not just happening with PDFs but also with .mp3, .wav, .docx, .mov, .jpg and a slew of other file types. This saga of bad metadata in PDFs has been recognized since at least 2004 by James Howison & Abby Goodrum. 2004. Why can’t I manage academic papers like MP3s? The evolution and intent of Metadata standards.
So, today I was looking around to see if Adobe AIR can indeed use some of the available tools to propagate the correct metadata in the files before upload so that when the files arrive in DSpace that they will have the correct metadata.
The first step is to retrieve metadata from files. It seems that Adobe AIR can do this with PDFs. (One would hope so as they are both brain children of the geeks at Adobe.) However, what is needed in this particular set up is a two way street with a check in between. We would need to overwrite what was there with the data we want there.
Even if the Resource and Metadata Packager has the abilities to embed the metadata in the files themselves, it does not mean that the submitters would know about how to use them or why to use them. This is not, however, a valid reason to not include functionality in a development project. All marketing aside, an archive does have a responsibility to consumers of the digital content, that the content will be functional. Part of today’s “functional” is the interoperability of metadata. Consumers do appreciate – even expect – that the metadata will be interoperable. The extra effort taken on the submitting end of the process, pays dividends as consumers use the files with programs like Picasa, iPhoto, PhotoShop, iTunes, Mendeley, Papers, etc.
Another thought that comes to mind is that When one is dealing with large files (over 1 GB) It occurs to me that there is a reason for making a “preview” version of a couple of MB. That is if I have a 2 GB audio file, why not make 4 MB .mp3 file for rapid assessment of the file to see if it is worth downloading the .wav file. It seems that a metadata packager could also create a presentation file on the fly too. This is no-less true with photos or images. If a command-line tool could be used like imagemagick, that would be awesome.
This problem has been addressed in the open source library science world. In fact a nice piece of software does live out there. It is called the Metadata Extraction Tool. It is not an end-all for all of this archive’s needs but it is a solution for some needs of this type.