Funding language documentation 

Just a quick thought.

Perception based loosely on facts:

A lot of language documentation money gets pushed towards endangered languages or languages with very few speakers. Is often endowed upon the aspiring academic, who may be promising to create a grammar for a previously un-written or undescribed language.

Sometimes I have the opportunity to read grammars. I read them and have questions about how the described data sounds. Both In context and as elicited. To that end I wonder if it wouldn't be money better spent for language documentation and benefit to the academy, if organizations funding language documentation research for the academy would rather fund the collection of audio texts and video texts of data already described in grammars. In a way provide the support that modern grammars should have.

That is, I find that often the state of grammars about languages (often about African languages) are so fraught with errors, or jaded with theoretical disposition, that it would be immensely helpful if these grammars were supported with audio texts. It seems that the focus on small, often dying, languages, requiring an impetus of "adequate" endangerment for funding, shows a predisposition to try and collect specimens of some exotic language. While the collection of rare specimens is good in some sense, it is not always the most gentrifying for the language speakers, nor is it really the most helpful for academic pursuits.

An Awesome list for Open Source Software

The world is full of problems. Some of these can be solved through the use of technology. Other problems can't be solved directly through the use of technology, but the deployment of technologies can impact social environments and social interactions in a way so that the problems not solvable directly though technology can be addressed.

Low-resource languages suffer from one of the problems of the second type. That is there is a sociological problem that follows briefly in the following way:

Feel-good, and do-good linguists often want to help "low-resourced" language community have digital tools in their language. This scenario is mirrored by the a different scenario, which may be contrasted with people "helping" from the outside. That is,people from within the low-resource language want to create tools for using their language - often in written form - in digital contexts. The result is that there are often a set of persons who are project managers, or who hold the business strings (access to grant funding, and are responsible for contracting with technologists to implement the project ideas and goals). The problem that occurs is that the more project managers there are (which might be more than one per language - with over 7,000 languages) the more divergent the technological solutions which are expensive and often not compatible or extensible - even if they are "open sourced".

Problem we are trying to solve is the communication problem between the plethora of coders which vary from cowboy soloists to dedicated shops working on language software targeted for use in Low-resourced communities, and the growing number of visionary project managers, who might have a background in linguistics, but often not have one in information technology or in information technology project management.

The benefits of collaboration, and reusability of code are obvious. However, there still stand a large gap between the project manager and the coding technologist. We find that this gap can be characterized by two critical problems:

What are the things which have been coded - and for what purpose are they coded?
Assuming that this data can be gathered, how can this data be quirky made usable so that project managers can use the information intelligently in their evolving relationships with their technical teams? In summary we must create a pile of data, and then we need to make it usable.
In the spirit of taking baby steps, we have started to amass a pile of data (asking the question - what do we know has been coded), we have started with a solution which is more native to coders than to project managers. We have used an element of Github culture - 'the Awesome list'.

While this does list does make a browse able list, it does not address the myriad of points of view which project managers come from. Synthesizing the data to match the various points of view; making the data relevant and usable is still an open task.

Lexical Data Management helps (with SIL software)

This is a quick note to record some of the things I have learned this week about working with lexical data within SIL's software options.

  1. There is information scattered all over the place:
  2. What should the purpose of the websites be? to distribute the product or to build community around the product's existence?

Software Needs for a Language Documentation Project

In this post I take a look at some of the software needs of a language documentation team. One of my ongoing concerns of linguistic software development teams (like SIL International's Palaso or LSDev, or MPI's archive software group, or a host of other niche software products adapted from main stream open-source projects) is the approach they take in communicating how to use the various elements of their software together to create useful workflows for linguists participating in field research on minority languages. Many of these software development teams do not take the approach that potential software users coming to their website want to be oriented to how these software solutions work together to solve specific problems in the language documentation problem space. Now, it is true that every language documentation program is different and will have different goals and outputs, but many of these goals are the same across projects. New users to software want to know top level organizational assumptions made by software developers. That is, they want to evaluate how software will work in a given scenario (problem space) and to understand and make informed decisions based on the eco-system that the software will lead them into. This is not too unlike users asking which is better Android or iPhone, and then deciding what works not just with a given device but where they will buy their music, their digital books, and how they will get those digital assets to a new device, when the phone they are about to buy no-longer serves them. These digital consequences are not in the mind of every consumer... but they are nonetheless real consequences.
Continue reading

Audio Dominant Texts and Text Dominant Audio

As linguistics and language documentation interface with digital humanities there has been a lot of effort to time-align texts and audio/video materials. At one level this is rather trivial to do and has the backing of comercial media processes like subtitles in movies. However, at another level this task is often done in XML for every project (digital corpus curation) slightly differently. At the macro-scale the argument is that if the annotation of the audio is in XML and someone wants to do something else with it, then they can just convert the XML to whatever schema they desire. This is true.

However, one antidotal point that I have not heard in discussion of time aligned texts is specifications for Audio Dominant Text vs. Text Dominant Audio. This may not initially seem very important, so let me explain what I mean.
Continue reading


I have been working on describing the FLEx software eco-system (for both a blog post and an info-graphic). In the process I googled "language documentation" workflow and was promptly directed to resources created for InField and aggregated via An amazing set of resources. the website is well put together and the content from InField 2010 and 2008 is amazing - I which I could have been there. I am almost convinced that most SIL staff pursuing linguistic fieldwork should just go to InField... But it is true that InField seems to be targeted at someone who has had more than one semester of linguistics training.

Language and Culture Documentation v.s. Cultural Digital Natives

I feel that in the language and culture documentation community that there is a tension between “documenting” and “globalizing”. In the sense that what we as digital natives and cultural technologists think is “living” is in part “documenting”.

Now, in some sense “Language Documentation” is an academic pursuit of its own right independent of linguistics if it has a plan and tries to capture elements of the expression of the culture and language as it is spoken or acted out. I think there is a bit of confusion in the literature as linguists move from linguistics to language development and community development. This is particularly evident with the use of video in language documentation. Continue reading

The Look of Language Archive Websites

This the start of a cross-language archive look at the current state of UX design presenting Content generated in Language Documentation.