This post is a open draft! It might be updated at any time... But was last updated on at .
The online version of the SIL Bibliography contains a subset of over 29,000 citations from the more than 40,000 publications representing 75 years of SIL International's language research in over 2,700 languages.
Finding Resources through SIL.org's (as of 2 August 2012) Bibliography can be a challenge at times - Maybe even a time-wasting endeavor. Time wasting because it might not be very useful to consult the online Bibliography.
The challenging aspect which affects usefulness is primarily three fold:
Items known by SIL to have been created by SIL staff may or may not be listed. (The on-line Bibliography is a sub-set.)
Items listed in the Bibilography may or may not have digitally accessible resources.
Items created by SIL staff may or may not be in the bibliography because they have not been submitted to the Language and Culture Archive (managing division of the SIL Bibliography).
January 4-5, 2012, I had the opportunity to participate in the LSA's Satellite Workshop for Sociolinguistic Archival Preparation in Portland, Oregon. There were a great many things I learned there. So here are only a few thoughts.
Part of the discussion at the workshop was on how we can make corpora which are collected by Sociolinguists available to the larger Sociolinguistic community. In particular the discussion I am referencing revolved around the standardisation of metadata in the corpora. (In the discussion it was established that are two levels of metadata, "event level" and "corpus level".) While OLAC gives us some standardization about the corpus level metadata, the event metadata is still unique to each investigation, and arguably this is necessary. However, it was also pointed out that not all "event level" metadata need to be encoded or tracked uniquely. That is, data like date of recording, name of participants, location of recording, gender (male/female) of participant, can all be regularized across the community.
With the above as preface, it is important to realize that we do need to understand that there are still various kinds of metadata which need to be collected. In the workshop it was acknowledged that the field of language documentation was about 10 years ahead of this community of sociolinguists.What was not well defined in the workshop was what the distinction is between a language documentation corpus and a sociolinguistics corpus. It seems to me as a new practitioner that the chief difference between these two types of corpora is the self identifying quality of researcher. That is does the researcher self-identify as a Sociolinguist or as a Language Documenter. Both types of corpora attempt to get at the vernacular, and both types of corpora collect sociolinguistic facts. It would seem that both corpora are essentially the same (give or take a few metadata attributes). So, I will take an example from the metadata write-up I did for the Meꞌphaa language documentation project. In that project we collected metadata about:
Equipment settings during recording
In the following diagram I illustrate the cross cutting of a corpus with these "kinds" of metadata. The heavier, darker line represents the corpus, while the medium heavy lines represent the "kinds" of metadata. Finally, the lighter lines represent the sub-kinds of metadata, where the sub-kinds might be the latitude, longitude, altitude, datum, country, and place name of the location.
Corpora metadata categories with some sub-categories
This does not mean that the corpus does not also need to be cross cut with these other "sub-kinds". However, these sub-kinds are significantly more in number and will very from project to project. Some of these metadata kinds will be collected in a speaker profile questionnaire. But some of these metadata can only be provided with reflection on the event. To demonstrate the cross cutting of these metadata elements on a corpus I have provided the following diagram. It uses categories which were mentioned in the workshop and is not intended to be comprehensive. In this second diagram, the cross cutting elements might themselves be taxonomies. They may have controlled vocabularies or they may have an open set of possible values, they may also represent a scale.
Taxonomies for social demographics and social dynamics for speakers in corpora
Both of these diagrams tend to illustrate what in this workshop were referred to a "event level" metadata, rather than "corpus level" metadata.
A note on corpus level metadata v.s. descriptive metadata
There is one more thing which I would like to say about "corpus level" metadata. Metadata is often separated out by function. That is what does the metadata allow us to do, or why is the metadata there?
I have been exposed to the following taxonomy of metadata types though course work and in working with photographs and images. These classes of metadata are also similar to those posted by JISC Digital Media as they approach issues with Metadata for digital audio.
Descriptive meta-data: supports discovery, attribution and identification of resources created.
Administrative meta-data: supports management, preservation, and appropriate usage of resources created.
Technical: About the machinery used to create the resource and the technical aspects of the resource.
Use and Rights: Copyright, license and moral ownership of the items.
Structural meta-data: maintains relationships between the parts of complex, multi-part resources (Spanne 2008).
Situational: this is metadata which describes the events around the creation of the work. Asking questions about the social setting, or the precursory events. It follows ideas put forward by Bergqvist (2007).
Use metadata: metadata collected from or about the users themselves (e.g. user annotations, number of people accessing a particular resource)
I think it is only fair to point out to archivist and to librarians that linguists and language documenters do not see a difference between descriptive and non-descriptive metadata in their workflows. That is sometimes we want to search all the corpora by licenses or by a technical attribute. This elevates the these attributes to the function of discovery metadata. It does not remove the function of descriptive metadata from its role in finding things but it does functionally mean that the other metadata is also viable as discovery metadata.