This paper is motivated by an experience in collecting, analyzing, and then redeploying (sharing while making relevant to other corporate SIL functions) corporate intellectual assets. These assets are relevant to both products SIL products and services and corporate processes. This paper attempts to document some of the current challenges presented to the SIL staff person as well as present some items for consideration in overcoming these challenges. Continue reading →
The Ethnologue  M. Paul Lewis. (ed.), 2009. Ethnologue: Languages of the World, 16th Edn. Dallas, Tex.: SIL International. as an academic book, is somewhat of a straw man in linguistics. Many people who write grants for language documentation projects (generally on under described or endangered languages) will cite the Ethnologue and some other resources or lack of resources  Steven A. Marlett. 2011. Documenting the Me’phaa genus. DEH-NEH fellowship proposal. http://www.neh.gov/grants/guidelines/pdf/DEL_NEH_Marlett.pdf. [PDF] [DEL Awards] [Accessed: 15 February 2011]  Sadaf Munshi. 2011. Archive of Annotated Burushaski Texts. NSF grant proposal. http://www.neh.gov/grants/guidelines/pdf/DEL_NSF_Munshi.pdf. [PDF] [DEL Awards] [Accessed: 15 February 2011]  Monica A. Macaulay. 2011. Potawatomi Documentation, Lexical Database, and Dictionary. NEH grant proposal. http://www.neh.gov/grants/guidelines/pdf/DEL_NEH_Macaulay.pdf. [PDF] [DEL Awards] [Accessed: … Continue reading . These efforts seeking funding are usually an effort to get more language data. The rationale for this is two fold:
Because so little is known that we do not know if the Ethnologue is correct.
Because there is a conflict between other published sources and the Ethnologue  Roger Blench. n.d. Introduction to the Temein languages http://www.rogerblench.info/Language/Nilo-Saharan/Eastern%20Sudanic/Temein%20cluster/Blench%20Temein%20language%20NM%20proceedings.pdf [PDF] … Continue reading .
Roger Blench. n.d. Introduction to the Temein languages http://www.rogerblench.info/Language/Nilo-Saharan/Eastern%20Sudanic/Temein%20cluster/Blench%20Temein%20language%20NM%20proceedings.pdf [PDF] [Accessed: 15 February 2011]
SEO for standard websites is pretty straight forward. I happen to be working on a website redesign (in Drupal) which presents Linguistic resources both published and unpublished. I recently came across two specialized SEO options which are useful:
January 4-5, 2012, I had the opportunity to participate in the LSA's Satellite Workshop for Sociolinguistic Archival Preparation in Portland, Oregon. There were a great many things I learned there. So here are only a few thoughts.
Part of the discussion at the workshop was on how we can make corpora which are collected by Sociolinguists available to the larger Sociolinguistic community. In particular the discussion I am referencing revolved around the standardisation of metadata in the corpora. (In the discussion it was established that are two levels of metadata, "event level" and "corpus level".) While OLAC gives us some standardization about the corpus level metadata, the event metadata is still unique to each investigation, and arguably this is necessary. However, it was also pointed out that not all "event level" metadata need to be encoded or tracked uniquely. That is, data like date of recording, name of participants, location of recording, gender (male/female) of participant, can all be regularized across the community.
With the above as preface, it is important to realize that we do need to understand that there are still various kinds of metadata which need to be collected. In the workshop it was acknowledged that the field of language documentation was about 10 years ahead of this community of sociolinguists.What was not well defined in the workshop was what the distinction is between a language documentation corpus and a sociolinguistics corpus. It seems to me as a new practitioner that the chief difference between these two types of corpora is the self identifying quality of researcher. That is does the researcher self-identify as a Sociolinguist or as a Language Documenter. Both types of corpora attempt to get at the vernacular, and both types of corpora collect sociolinguistic facts. It would seem that both corpora are essentially the same (give or take a few metadata attributes). So, I will take an example from the metadata write-up I did for the Meꞌphaa language documentation project. In that project we collected metadata about:
Equipment settings during recording
In the following diagram I illustrate the cross cutting of a corpus with these "kinds" of metadata. The heavier, darker line represents the corpus, while the medium heavy lines represent the "kinds" of metadata. Finally, the lighter lines represent the sub-kinds of metadata, where the sub-kinds might be the latitude, longitude, altitude, datum, country, and place name of the location.
Corpora metadata categories with some sub-categories
This does not mean that the corpus does not also need to be cross cut with these other "sub-kinds". However, these sub-kinds are significantly more in number and will very from project to project. Some of these metadata kinds will be collected in a speaker profile questionnaire. But some of these metadata can only be provided with reflection on the event. To demonstrate the cross cutting of these metadata elements on a corpus I have provided the following diagram. It uses categories which were mentioned in the workshop and is not intended to be comprehensive. In this second diagram, the cross cutting elements might themselves be taxonomies. They may have controlled vocabularies or they may have an open set of possible values, they may also represent a scale.
Taxonomies for social demographics and social dynamics for speakers in corpora
Both of these diagrams tend to illustrate what in this workshop were referred to a "event level" metadata, rather than "corpus level" metadata.
A note on corpus level metadata v.s. descriptive metadata
There is one more thing which I would like to say about "corpus level" metadata. Metadata is often separated out by function. That is what does the metadata allow us to do, or why is the metadata there?
I have been exposed to the following taxonomy of metadata types though course work and in working with photographs and images.  Photometadata.org. 2011. Classes Of Metadata. http://www.photometadata.org/node/46. [Link] [Accessed: 18 January 2012] These classes of metadata are also similar to those posted by JISC Digital Media as they approach issues with Metadata for digital audio.  JISC Digital Media. 07 January 2010. Metadata and Audio Resources. http://www.jiscdigitalmedia.ac.uk/audio/advice/metadata-and-audio-resources [Link] [Accessed: 19 March 2012]
Descriptive meta-data: supports discovery, attribution and identification of resources created.
Administrative meta-data: supports management, preservation, and appropriate usage of resources created.
Technical: About the machinery used to create the resource and the technical aspects of the resource.
Use and Rights: Copyright, license and moral ownership of the items.
Structural meta-data: maintains relationships between the parts of complex, multi-part resources (Spanne 2008).  Spanne, Joan. 2008. Metadata: Why, What and How (the “Who” is You). Presentation for Audio and Video Techniques. Dallas: GIAL. 29 July 2008.
Situational: this is metadata which describes the events around the creation of the work. Asking questions about the social setting, or the precursory events. It follows ideas put forward by Bergqvist (2007).  Bergqvist, Henrik. 2007. The role of metadata for translation and pragmatics in language documentation. In Peter K. Austin (ed.), Language Documentation and Description, vol. 4, 163-73. London: … Continue reading
Use metadata: metadata collected from or about the users themselves (e.g. user annotations, number of people accessing a particular resource)  JISC Digital Media. 07 January 2010. An Introduction to Metadata. http://www.jiscdigitalmedia.ac.uk/crossmedia/advice/an-introduction-to-metadata/ [Link] [Accessed: 19 March 2012]
I think it is only fair to point out to archivist and to librarians that linguists and language documenters do not see a difference between descriptive and non-descriptive metadata in their workflows. That is sometimes we want to search all the corpora by licenses or by a technical attribute. This elevates the these attributes to the function of discovery metadata. It does not remove the function of descriptive metadata from its role in finding things but it does functionally mean that the other metadata is also viable as discovery metadata.
While I was in Malaysia, I had the honor to meet and talk to quite a bit with Professor Emeritus Howard McKaughan. We talked a about his linguistics based work in Mexico, the Philippines, and in Malaysia. He can tell stories, interesting stories.
Howard - Story Telling
There is something unique about his generation of Americans (currently in their 80s and 90s). It is their ability to craft and tell stories. I feel that this is a cultural point I don’t have. It could be because I am third culture, or because I talk to much of the macro-details, or it might simply be because I am long winded. Continue reading →
In October, Becky and I were invited to present FLEx at the Universiti of Malaysia, Sabah as part of a workshop for compiling native dictionaries and managing cultural data. I learned a lot about dictionaries, about using FLEx to organize dictionary data, about Webonary and about Malaysia.
One of the things this workshop helped me to clearly articulate was that there are four knowledge content areas which dictionary creators need:
Knowledge about Theoretical Linguistics to understand the language being described and the categories possible in the dictionary.
Knowledge about the language being analyzed and described so that they can apply the appropriate options available to this situation.
Knowledge about how to manage the editorial process for the dictionary (including entry submission).
Knowledge about how to use the software to implement the editorial process.
This workshop’s focus was only on the software used to implement the editorial process (mostly the data collection part of the editorial process). So in some ways it felt like we weren’t giving the participants all the tools they will need (or even showing them all the tools they will need). But we had to realize that it is not our responsibility to give them all the tools they need or to expose them to these issues. They need local contacts for that. Regardless of these issue we were still ecstatic that there were about 80 people in attendance.
About 80 people
Opening Cerimonies at UMS
Becky took most of the sessions on FLEx. She presented on using FLEx as a tool for collecting words and various things about words. We covered several input methods and features in the application.
Becky talking about FLEx as a tool
Becky helping people doing exercises
I presented a session on explaining how to get data out of FLEx. We talked about putting dictionary data on the web and turning it into .epub files.
Hugh presenting on getting things out of FLEx
I think one of the more interesting things that I learned was about expectations, culture and photographs.
Many people wanted photographs with us (or of us). This is not totally unexpected. What was unexpected was that rather than taking one photo and sharing it (passing it around), everyone wanted their own picture. Not their own picture with us but a picture with us made with their own camera! It was in that moment that I had an epiphany. Having training in Language Documentation I am aware and concerned with rules and laws concerning privacy. In the U.S. when dealing with issues of informed consent and intellectual property, it can not be assumed that if I want to take a picture of you that I, the owner of the camera, own the picture. Furthermore it can not be assumed that I have the right to do with that picture as I please. i.e. Post it to the internet. This may be in part that our laws are based on our semantics. It may be in part our culture. But there I realized that if the photo is taken with your camera you own the photo. You can do with it as you please. The asking for permission is that you have asked for permission to take the photo.
Taking our picture
Taking their picture, while they were taking a picture of us. Since he who owns the camera, owns the picture...
I took this last picture at about the same time I had the epiphany.
An archival version of an audio file is a file which represents the original sound faithfully. In archiving we want to keep a version of the audio which can be used to make other products and also be used directly itself if needed. This is usually done through PCM. There are several file types which are associated with PCM or RAW uncompressed faithful (to the original signal) digital audio. These are:
Broadcast Wave Format (BWF)One way to understand the difference between audio file formats is understanding how different format are used. One place which has been helpful to me has been the DOBBIN website as they explain their software and how it can change audio from one PCM based format to another.
Each one of these file types has the flexibility to have various kinds of components. i.e. several channels of audio can be in the same file. Or one can have .wav files with different bit depths or sampling rates. But they are each a archive friendly format. Before one says that a file is suitable for archiving simply based on its file format one must also consider things like sample rates, bit depth, embedded metadata, channels in the file, etc. I was introduced to DOBBIN as an application resource for audio archivists by a presentation by Rob Poretti.  Rob Poretti. 2011. Audio Analysis and Processing in Multi-Media File Formats. ARSC 2011. [Accessed: 24 October 2011] http://www.arsc-audio.org/conference/audio2011/extra/48-Poretti.pptx [Link] One additional thing that is worth noting in terms of archival versions of digital audio pertains to born digital materials. Sometimes audio is recored directly to a lossy compressed audio format. It would be entirely appropriate to archive a born-digital filetype based on the content. However it should be noted that in this case the recordings should have been done in a PCM file format.
What is a presentation version? (of an audio file)
A presentation version is a file created with a content use in mind. There are several general characteristics of this kind of file:
It is one that does not retain the whole PCM content.
It is usually designed for a specific application. (Use on a portable device, or personal audio player)
It can be thought of as a derivative product from an original audio or video stream.
In terms of file formats, there is not just one file format which is a presentation format. There are many formats. This is because there are many ways to use audio. For instance there are special audio file types optimized for various kinds of applications like:
3G and WiFi Audio and A/V services
Internet audio for streaming and download
Digital Satellite and Cable
Portable playersA brief look a an explanation by Cube-Tec might help to get the gears moving. It is part of the inspiration for this post.
This means there is a long list of potential audio formats for the presentation form.
Amiga IFF/SVX8/SV16 (iff)
Audio Visual Research (avr)
CDXA, like Video-CD (dat)
Ensoniq PARIS (paf)
FastTracker2 Extended (xi)
Midi Sample dump Format (sds)
Monkey’s Audio (ape/mac)
Mpeg 1&2 container (mpeg/mpg/vob)
Mpeg 4 container (mp4)
Mpeg audio specific (mp2/mp3)
Mpeg video specific (mpgv/mpv/m1v/m2v)
Portable Voice format (pvf)
Sound Designer 2 (sd2)
Windows Media (asf/wma/wmv)
Aside from just the file format difference in media files (.wav vs. .mp3) there are three other differences to be aware of:
Media stream quality variations
Media container formats
Possibilities with embedded metadata
Media stream quality variations
Within the same file type there might be a variation of quality of audio. For instance Mp3 files can have a variable rate encoding or they can have a steady rate of encoding. When they have a steady rate of encoding they can have a High or a low rate of encoding. WAV files can also have a high or a low bit depth and a high or a low sample rate. Some file types can have more channels than others. For instance AAC files can have up to 48 channels where as Mp3 files can only have up to 5.1 channels.  Various Contributors. 21 October 2011 at 21:44 . Wikipedia: Advanced Audio Coding, AAC’s improvements over MP3. http://en.wikipedia.org/wiki/Advanced_Audio_Coding#AAC.27s_improvements_over_MP3 … Continue reading
One argument I have heard in favor of saving disk space is to use lossless compression rather than WAV files for archive quality (and as archive version) recordings. As far as archiving is concerned, these lossless compression formats are still product oriented file formats. One thing to realize is that not every file format can hold the same kind of audio. Some formats have limits on the bit depth of the samples they can contain, or they have a limit on the number of audio channels they can have in a file. This is demonstrated in the table below, taken from wikipedia.  Various Contributors. 21 October 2011 at 10:26 . Wikipedia:Comparison of audio formats, Technical Details of Lossless Audio Compression Formats. … Continue reading This is where understanding the relationship between a file format, a file extension and a media container format is really important.
Media container formats can look like file types but they really are containers of file types (think like a folder with an extension). Often they allow for the bundling of audio and video files with metadata and then enable this set of data to act like a single file. On wikipedia there is a really nicecomparison of container formats.
MP4 is one such container format. Apple Lossless data is stored within an MP4 container with the filename extension .m4a – this extension is also used by Apple for AAC audio data in an MP4 container (same container, different audio encoding). However, Apple Lossless is not a variant of AAC (which is a lossy format), but rather a distinct lossless format that uses linear prediction similar to other lossless codecs such as FLAC and Shorten.  Various Contributors. 6 October 2011 at 03:11. Wikipedia: Apple Lossless. http://en.wikipedia.org/wiki/Apple_Lossless [Link] Files with a .m4a generally do not have a video stream even though MP4 containers can also have a video stream.
MP4 can contain:
Video: MPEG-4 Part 10 (H.264) and MPEG-4 Part 2
Other compression formats are less used: MPEG-2 and MPEG-1
Audio: Advanced Audio Coding (AAC)
Also MPEG-4 Part 3 audio objects, such as Audio Lossless Coding (ALS), Scalable Lossless Coding (SLS), MP3, MPEG-1 Audio Layer II (MP2), MPEG-1 Audio Layer I (MP1), CELP, HVXC (speech), TwinVQ, Text To Speech Interface (TTSI) and Structured Audio Orchestra Language (SAOL)
Other compression formats are less used: Apple Lossless
Subtitles: MPEG-4 Timed Text (also known as 3GPP Timed Text).
Nero Digital uses DVD Video subtitles in MP4 files  Various Contributors. 11 October 2011 at 15:00. Wikipedia: MPEG-4 Part 14. http://en.wikipedia.org/wiki/.m4a [Link]
This means that an .mp3 file can be contained inside of an .mp4 file. This also means that audio files are not always what they seem to be on the surface. This is why I advocate for an archive of digital files which archives for a digital publishing house to also use technical metadata as discovery metadata. Filetype is not enough to know about a file.
Possibilities with embedded metadata
Audio files also very greatly on what kinds of embedded metadata and metadata formats they support. MPEG-7, BWF and MP4 all support embedded metadata. But this does not mean that audio players in the consumer market or prosumer market respect this embedded metadata. ARSC has in interesting report on the support for embedded metadata in audio recording software.  Chris Lacinak, Walter Forsber. 2011. A Study of Embedded Metadata Support in Audio Recording Software: Summary of Findings and Conclusion. ARSC Technical Committee. … Continue reading Aside from this disregard for embedded metadata there are various metadata formats which are embedded in different file types, one common type ID3, is popular with .mp3 files. But even ID3 comes in different versions.
In archiving Language and Culture Materials our complete package often includes audio but rarely is just audio. However, understanding the audio components of the complete package help us understand what it needs to look like in the archive. In my experience in working with the Language and Culture Archive most contributors are not aware of the difference between Archival and Presentation versions of audio formats and those who think they do, generally are not aware of the differences in codecs used (sometimes with the same file extension). From the archive’s perspective this is a continual point of user/submitter education. This past week have taken the time to listen to a few presentations by Audio Archivist from the 2011 ARSC convention. These in general show that the kinds of issues that I have been dealing with in the Language and Culture Archive are not unique to our context.
Various Contributors. 21 October 2011 at 10:26 . Wikipedia:Comparison of audio formats, Technical Details of Lossless Audio Compression Formats. http://en.wikipedia.org/wiki/Comparison_of_audio_codecs#Technical_Details_of_Lossless_Audio_Compression_Formats [Link]
Chris Lacinak, Walter Forsber. 2011. A Study of Embedded Metadata Support in Audio Recording Software: Summary of Findings and Conclusion. ARSC Technical Committee. http://www.arsc-audio.org/pdf/ARSC_TC_MD_Study.pdf [Link]
I have recently been reading the blog of Martin Fenner and came upon the article Personal names around the world Martin Fenner. 14 August 2011. Personal names around the world. PLoS Blog Network. http://blogs.plos.org/mfenner/2011/08/14/personal-names-around-the-world . [Accessed: 16 September 2011]. [Link] . His post is in fact a reflection on a W3C paper on Personal Names around the WorldSeveral other reflections are here: http://www.w3.org/International/wiki/Personal_names (same title). This is apparently coming out of the i18n effort and is an effort to help authors and database designers make informed decisions about names on the web.
I read Martin’s post with some interest because in Language Documentation getting someone’s name as a source or for informed consent is very important (from a U.S. context). Working in a archive dealing with language materials, I see lot of names. One of the interesting situations which came to me from an Ecuadorian context was different from what I have seen in the w3.org paper or in the w3.org discussion. The naming convention went like this:
The elder was known by the younger’s name plus a relationship.
My suspicion is that it is a taboo to name the dead. So to avoid possibly naming the dead, the younger was referenced and the the relationship was invoked. This affected me in the archive as I am supposed to note who the speaker is on the recordings. In lue of the speakers name, I have the young son’s first name, who is well known in the community, and is in his 30’s or so, and I have the relationship. So in English this might sound like John’s mother. Now what am I supposed to put in the metadata record for the audio recordings I am cataloging? I do not have a name but I do have a relationship to a known (to the community) person.
I inquired with a literacy consultant who has worked in Ecuador with indigenous people for some years, she informed me that in one context she was working in everyone knew what family line they were from and all the names were derived from that family line by position. It was of such that to call someone by there name was an insult.
It sort of reminds me of this sketch by Fry and Laurie.
There is a myriad of difficulties in overlaying language data with geographical data. But it has be done and can be done. While I was working in México on a language documentation project, I learned that some of the language mixing (not quite diglossia, rather the living of two people groups with different languages in the same spaces) was due geographical factors and economical factors pulling them into the same geographic locations. In the particular case I am thinking of there was a mountain pass and a valley on the way to the major center of trade. In this sort of context the interesting things are displayed not when a polygon is drawn showing a territorial overlay of where various language speakers living, but where something is drawn showing what the density or population dispersion per general population is. Some of the most detailed (in terms of global perspective) language maps can be found in the Ethnologue  Lewis, M. Paul (ed.). 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. .
Western Central Mexico from the Ethnologue
However, as I was working on the language documentation project I found out how much effort actually goes into that sort of map. ArcGIS, the software used to create the maps can not auto-generate a polygon a certain distance around a combined set of given points. A set of points can be selected and each point can get a 5 mile radius. What this means is that each polygon has to be hand drawn. This sort of graphical overly that is used in the the Ethnologue  Map of Languages in Western Mexico in the Ethnologue. [Accessed: 9 September 2011] http://www.ethnologue.com/show_map.asp?name=MX&seq=30. [Link] does not show the density of speakers of a language in an area relative to the total population (in the Ethnologue’s defense I am not sure it is supposed to). For instance, if I wanted to know “What is the density of speakers in the Me’phaa area of México relative to speakers of other languages?” that would show me some dispersion, and by implication the peopling of the area. This sort of geographical overlay may be closer to displaying social networks, not really bilingualism or diglossia. There might be some bilinguals or some average level of bilingualism there, but the heat map method of plotting is looking still at the density of speakers to an area. A simular map might be created of New York City where certain languages are given a color based on their distribution density in the area. Additionally, these sorts of data overlays are probably more prone to lend insights on language attrition patterns or language speaker migration patterns. Also these hand drawn polygons change (a little) from edition to edition. Because the data used to create the polygons is not referenced (cited) it is hard to tell if the change is keeping pace with language attrition and/or population movement or if the changes are due to a better linguistic understanding in a particular area. When looking at the large area maps in the Ethnologue,  Map of Languages in the Americas in the Ethnologue. [Accessed: 9 September 2011] http://www.ethnologue.com/show_map.asp?name=Americas&seq=10. [Link] it is hard to tell if the red dots represent “traditional” language area (or geographical center thereof) or if the points represent the current geographical center of the speaking area. Either way the plotting functions as if it were a heat map showing the diversity of languages over a geographical area.
Americas Map from the Ethnologue
I am generally on the look out for web apps and APIs which can be used to overlay data to bring new insights to situations through graphical representations. I recently found a tool for overlaying data on Google Maps. This tool creates heat maps given data from another source. This tool is called gHeat. This tool was brough to my attention by Been O’Steen as he modified gHeat to display some prices for student properties Ben O’Steen. 2011. Student Property Heatmap. Random Hacks: Hacks, code and other things. [Accessed: 2 September 2011] http://benosteen.wordpress.com/2011/07/26/student-property-heatmap . … Continue reading in the UK. My initial thought was: “Wow how can we do language maps like this?”
Student Property Heat Map
Obviously I still think that language based heat maps could prove to provide language workers world wide access to visualizations of data that could really add clarity to the language vitality situation.