This week I have been outlining the types of data that linguists need to be able to use and relate to each other as they do Language Documentation and Linguistic Research. I try to express these things graphically and then also express where some of the leading tools which SIL International is offering sit in the problem space.
A User Experience look at Linguistic Archiving
In a recent paper Jeremy Nordmoe, a friend and colleague, states that:
Because most linguists archive documents infrequently, they will never be experts at doing so, nor will they be experts in the intricacies of metadata schemas.
My initial reply is:
You are d@#n right! and it is because archives are not sexy enough!
In a team framework where there are several members of a research team and the job requirements call for the sharing of bibliographic data (of materials referenced) as well as the actual resources being referenced. In this environment there needs to be a central repository for sharing both kinds of data. This is true for small localized (geographically) groups as well as large distributed research teams. New researchers joining a existing team need to be able to “plug-in” to existing foundational work on the project and be able to access bibliographic data as well as the resources those bibliographic details point to. It is my point here to outline some of the current challenges involved in trying to overcoming the collaborative obstacle when working in the fields of Linguistics and Language Documentation.This sentiment is echoed by many in the world of science. Here is someone on Zetero’s forums [INSERT LINK]. (Though Zetero does claim to combat some of these issues.)
Bibliographic Data v.s Citation Data
This post is a open draft! It might be updated at any time… But was last updated on December 19, 2014 at 1:10 am.
Pre-Print Draft will not be available through this means, though there is a video of the presentation.
A. Meꞌphaa Text Sample
A̱ ngui̱nꞌ, tsáanꞌ ninimba̱ꞌlaꞌ ju̱ya̱á Jesús, ga̱ju̱ma̱ꞌlaꞌ rí phú gagi juwalaꞌ ído̱ rí nanújngalaꞌ awúun mbaꞌa inii gajmá. Numuu ndu̱ya̱á málaꞌ rí ído̱ rí na̱ꞌnga̱ꞌlaꞌ inuu gajmá, nasngájma ne̱ rí gakon rí jañii a̱kia̱nꞌlaꞌ ju̱ya̱á Ana̱ꞌlóꞌ, jamí naꞌne ne̱ rí ma̱wajún gúkuálaꞌ. I̱ndo̱ó máꞌ gíꞌmaa rí ma̱wajún gúkuálaꞌ xúgíí mbiꞌi, kajngó ma̱jráanꞌlaꞌ jamí ma̱ꞌne rí jañii a̱kia̱nꞌlaꞌ, asndo rí náxáꞌyóo nitháan rí jaꞌyoo ma̱nindxa̱ꞌlaꞌ. [I̱yi̱i̱ꞌ rí niꞌtháán Santiágo̱ 1:2-4]
B. Sochiapam Chinantec Text Sample
Hnoh² reh², ma³hiún¹³ hnoh² honh² lɨ³ua³ cáun² hi³ quiunh³² náh², quí¹ la³ cun³ hi³ má²ca³lɨ³ ñíh¹ hnoh² jáun² hi³ tɨ³ jlánh¹ bíh¹ re² lı̵́²tɨn² tsú² hi³ jmu³ juenh² tsı̵́³, nı̵́¹juáh³ zia³² hi³ cá² lau²³ ca³tɨ²¹ hi³ taunh³² tsú² jáun² ta²¹. Hi³ jáun² né³, chá¹ hnoh² cáun² honh², hi³ jáun² lı̵́¹³ lɨ³tɨn² hnoh² re² hi³ jmúh¹³ náh² juenh² honh², hi³ jáun² hnoh² lı̵́¹³ lı̵́n³ náh² tsá² má²hún¹ tsı̵́³, tsá² má²ca³hiá² ca³táunh³ ca³la³ tán¹ hián² cu³tí³, la³ cun³ tsá² tiá² hi³ lɨ³hniauh²³ hí¹ cáun² ñí¹con² yáh³. [Jacobo Jmu² Cáun² Sí² Hi³ Ca³tɨn¹ Tsá² *Judíos, Tsá² Má²tiáunh¹ Ñí¹ Hliáun³ 1:2-4]
C. Spanish Text Sample
Hermanos míos, gozaos profundamente cuando os halléis en diversas pruebas, sabiendo que la prueba de vuestra fe produce paciencia. Pero tenga la paciencia su obra completa, para que seáis perfectos y cabales, sin que os falte cosa alguna. [Santiago 1:2-4 Reina-Valera 1995 (RVR1995)]
D. English Text Sample
Dear brothers and sisters, when troubles come your way, consider it an opportunity for great joy. For you know that when your faith is tested, your endurance has a chance to grow. So let it grow, for when your endurance is fully developed, you will be perfect and complete, needing nothing. [James 1:2-4 New Living Translation (NLT 2007)]
I have been doing some thinking about what would make OLAC search more valuable to its current users and to its targeted users. One of the things which would make it more useful would be if the NSF, a partial funder for OLAC and OLAC search, would aggregate its language related grants, scholarships, fellowships and awards through OLAC.
Some of these Grant proposals are really well written, and well cited documents which explain a certain snapshot of the language situation. Even the announcements that a grants like From Endangered Language Documentation to Phonetic Documentation has been awarded would allow other researchers to know that someone has applied or been awarded a block of funding to work on a particular language situation.
I was particularly happy to find that NSF does have a grant offering and grant awarded search section. But aggregating this knowledge with prior research would really give interested parties in particular languages the integrated perspective.
The Ethnologue as an academic book, is somewhat of a straw man in linguistics. Many people who write grants for language documentation projects (generally on under described or endangered languages) will cite the Ethnologue and some other resources or lack of resources . These efforts seeking funding are usually an effort to get more language data. The rationale for this is two fold:
- Because so little is known that we do not know if the Ethnologue is correct.
- Because there is a conflict between other published sources and the Ethnologue .
While I was in Malaysia, I had the honor to meet and talk to quite a bit with Professor Emeritus Howard McKaughan. We talked a about his linguistics based work in Mexico, the Philippines, and in Malaysia. He can tell stories, interesting stories.
There is something unique about his generation of Americans (currently in their 80s and 90s). It is their ability to craft and tell stories. I feel that this is a cultural point I don’t have. It could be because I am third culture, or because I talk to much of the macro-details, or it might simply be because I am long winded.
There is a myriad of difficulties in overlaying language data with geographical data. But it has be done and can be done. While I was working in México on a language documentation project, I learned that some of the language mixing (not quite diglossia, rather the living of two people groups with different languages in the same spaces) was due geographical factors and economical factors pulling them into the same geographic locations. In the particular case I am thinking of there was a mountain pass and a valley on the way to the major center of trade. In this sort of context the interesting things are displayed not when a polygon is drawn showing a territorial overlay of where various language speakers living, but where something is drawn showing what the density or population dispersion per general population is. Some of the most detailed (in terms of global perspective) language maps can be found in the Ethnologue .
I am generally on the look out for web apps and APIs which can be used to overlay data to bring new insights to situations through graphical representations. I recently found a tool for overlaying data on Google Maps. This tool creates heat maps given data from another source. This tool is called gHeat. This tool was brough to my attention by Been O’Steen as he modified gHeat to display some prices for student properties in the UK. My initial thought was: “Wow how can we do language maps like this?”Obviously I still think that language based heat maps could prove to provide language workers world wide access to visualizations of data that could really add clarity to the language vitality situation.
The company I work for has an archive for many kinds of materials. In recent times this company has moved to start a digital repository using DSpace. To facilitate contributions to the repository the company has built an Adobe AIR app which allows for the uploading of metadata to the metadata elements of DSpace as well as the attachement of the digital item to the proper bitstream. Totally Awesome.
However, one of the challenges is that just because the metadata is curated, collected and properly filed, it does not mean that the metadata is embedded in the digital items uploaded to the repository. PDFs are still being uploaded with the PDF’s author attribute set to Microsoft-WordMore about the metadata attributes of PDF/A can be read about on pdfa.org. Not only is the correct metadata and the wrong metadata in the same place at the same time (and being uploaded at the same time) later, when a consumer of the digital file downloads the file, only the wrong metadata will travel with the file. This is not just happening with PDFs but also with .mp3, .wav, .docx, .mov, .jpg and a slew of other file types. This saga of bad metadata in PDFs has been recognized since at least 2004 by James Howison & Abby Goodrum. 2004. Why can’t I manage academic papers like MP3s? The evolution and intent of Metadata standards.
So, today I was looking around to see if Adobe AIR can indeed use some of the available tools to propagate the correct metadata in the files before upload so that when the files arrive in DSpace that they will have the correct metadata.
- The first step is to retrieve metadata from files. It seems that Adobe AIR can do this with PDFs. (One would hope so as they are both brain children of the geeks at Adobe.) However, what is needed in this particular set up is a two way street with a check in between. We would need to overwrite what was there with the data we want there.
- However, as of 2009, there were no tools in AIR which could manipulate exif Data (for photos).
- But it does look like the situation is more hopeful for working with audio metadata.
Three Lingering Thoughts
- Even if the Resource and Metadata Packager has the abilities to embed the metadata in the files themselves, it does not mean that the submitters would know about how to use them or why to use them. This is not, however, a valid reason to not include functionality in a development project. All marketing aside, an archive does have a responsibility to consumers of the digital content, that the content will be functional. Part of today’s “functional” is the interoperability of metadata. Consumers do appreciate – even expect – that the metadata will be interoperable. The extra effort taken on the submitting end of the process, pays dividends as consumers use the files with programs like Picasa, iPhoto, PhotoShop, iTunes, Mendeley, Papers, etc.
- Another thought that comes to mind is that When one is dealing with large files (over 1 GB) It occurs to me that there is a reason for making a “preview” version of a couple of MB. That is if I have a 2 GB audio file, why not make 4 MB .mp3 file for rapid assessment of the file to see if it is worth downloading the .wav file. It seems that a metadata packager could also create a presentation file on the fly too. This is no-less true with photos or images. If a command-line tool could be used like imagemagick, that would be awesome.
- This problem has been addressed in the open source library science world. In fact a nice piece of software does live out there. It is called the Metadata Extraction Tool. It is not an end-all for all of this archive’s needs but it is a solution for some needs of this type.