Software Needs for a Language Documentation Project

In this post I take a look at some of the software needs of a language documentation team. One of my ongoing concerns of linguistic software development teams (like SIL International's Palaso or LSDev, or MPI's archive software group, or a host of other niche software products adapted from main stream open-source projects) is the approach they take in communicating how to use the various elements of their software together to create useful workflows for linguists participating in field research on minority languages. Many of these software development teams do not take the approach that potential software users coming to their website want to be oriented to how these software solutions work together to solve specific problems in the language documentation problem space. Now, it is true that every language documentation program is different and will have different goals and outputs, but many of these goals are the same across projects. New users to software want to know top level organizational assumptions made by software developers. That is, they want to evaluate how software will work in a given scenario (problem space) and to understand and make informed decisions based on the eco-system that the software will lead them into. This is not too unlike users asking which is better Android or iPhone, and then deciding what works not just with a given device but where they will buy their music, their digital books, and how they will get those digital assets to a new device, when the phone they are about to buy no-longer serves them. These digital consequences are not in the mind of every consumer... but they are nonetheless real consequences.

Each project will have a variety of data it is interested in using or developing or manipulating. But in the large scope the kinds of data being collected will be the same. The most typical data types are outlined in the image below.

The Data Management Space for linguists with SIL software.

The Data Management Space for linguists with SIL software's target data areas added.

However, it is helpful for linguists and language documenters, who are not command-line users or shell script hackers, to have applications which do either specific tasks, are plugins to the software the researchers use or function to manipulate data from one format to another.

This post is an attempt to outline some of the tools in the FLEx eco-system (the only one I have a some sort of handle on.)

Local network collaboration is needed at the digital level. It is probably safe to make the assumption that because this is not just a single language (language being documented and language in which analysis is written) and that it is not just a single person on the research team (perhaps it is a multi-disciplinary research team, but minimally one should assume that indigenous speakers will participate in the documentation of their language at some level too). The implications is that data collected in the pre-field stage of the project and then also data in the field stage of the project need to be accesible to all members of the team. This does add to the complexity of local (on-site) data handling. However, this should not be viewed as a limitation, but rather a guiding principal in software development. In addition to the local network (and possibly wide area network design criteria), there are two other criteria which should be embedded in all linguistic software, but as of 2013 is lacking in many applications. These two other criteria are:

  1. Cross-operating system compatibility - that the software will work on the latest iterations of Windows, Mac OS X, and Linux (and possibly if appropriate, iOS and Android)
  2. User Experience centered design - most software coded for use in linguistic environments focuses on users which have either a command line fluency, training in a certain linguistic theoretical framework, or have a coder's mentality. Almost no software has a user experience and a user interface with a level of refinement which makes the new user at ease with the software. Something with the level of refinement of say iTunes where an uninformed user can start to intuit and use the software immediately. (iTunes does have several layers of complexity, and has several advanced features. We can learn from iTunes that not all software needs to have every feature presented in a manner which is immediately intuitable so as to eliminate any learning curve for every user, but rather the visual ugliness in presentation of software can make the software less palatable to users [1] Harald Felgner. 5 December 2007. Aesthetics as a Business Requirement. Presentation: [Link] giving users the impression that the software is un-necessarily complex.)

If we start to answer the question: What software does the language documentation team need? by starting with the assumption that the project is a collaboration between the researcher(s) and the community and that the researchers can bring something that the community could not get on their own. We do well to assume that the researcher would start at the library doing some searchers and relevant issues, and previous documentation conducted in the area and language group. As researchers we might look for publications, thesis or dissertations, datasets, corpora on: social issues, anthropology, government, history, laws, topography, climate, bio-diversity, scripts, orthographies used, previous linguistic research, and previous analysis put forward by that research. We might also want to know when all of that research was conducted, and where the data is stored (can we, the researcher see it or get a copy?), and who the players "researchers/funders" were.

Previous Research Publications and Citation Resource Management

Zotero seems to be a natural place to start to organize the citations and their related PDF resources. Now, the challenge comes how do we locally share content collected in the pre-field stage when researchers are on two separate continents? and then how do we continue the collaboration when the researchers are on some South Pacific island without a stable high-speed internet connection? Zotero has the ability to use WebDav syncing so a local netwok can be set up in the pre-field portion and then the small server carried to the field with the researchers (of course the small server, like Raspberry Pi or a Mac Mini Server should have a back-up/redundancy solution too). Those who are familiar with with Zotero, will know that it started off as FireFox browser plugin, but now it is both a plugin and/or a standalone application. It works cross platform. The one thing about previous research which it does not inherently track is funders and research project. It would also be difficult to manage large amounts of audio-visual data with Zotero (like data from a previous language documentation project). (While checking out Zotero, check out some of the plugins designed for it like Zotfile which adds some really functional features like file renaming based on metadata.) There are other applications out there in the same citation management problem space but none have the cross platform consistency or price point of Zotero. Some really popular or cool citation managers, but less recomendable in these contextsI think it is important to acknowledge that any human when given a choice is always going to chose the tool which is most familiar to them - even if it is using a hammar claw to till the garden. So, as a side note to that effect, one of the things I really like about Papers is its drag and drop features. It makes importing many small .ris files easy. Endnote is more cumbersome in this respect - meaning that as of Endnote X6 that one can only import one .ris file at a time and it must be done through a menu which does not have a shortcut key command. However, when dealing with .ris files it is possible to concatenate them with a command line tool like cat and then import the resulting file with one swift action from the menu. There are many tutorials on how to do this operation on various operating systems, here is a great tutorial on how to do this on OS X. are:

If we look at the positioning of Zotero in the FLEx eco-system it really is not tightly integrated. FLEx is a lexical data set manager. But where it could be more tightly integrated is in the ingest and use of citations. So for example, it is not uncommon for researchers to use small bits of language data when arguing for a theoretical point in some publication, or for old wordlists to be able to be found in archives as manuscripts. These manuscripts and papers are often converted to PDFs and researchers manage them as PDFs. However, if this language data is to be extracted from these PDFs and put in the lexicon database then the database needs to cite where the item came from. These lexemes the are usually of interest to the researcher in the field as they will then try and verify these lexemes. So, being able to import citations to the lexicon manager is important to the researcher. (Also of importance is the importing of data in databases or spread sheets. This use case will be covered in a different section.) FLEx as of version 7.2.3 cannot do this. Now, there are several citation transmission formats, the most popular is probably, .RIS followed by Endnote XML (an xml-ized version of .ris), BibTeX, and then unique custom XML formats. This is not the only necessary use of citations by the researcher. Citations are also a foundational part of resources authored by the language documentation team.

Previous Research Data

As researchers cull through previous publications, and perhaps procure from colleagues data sets (like ShoeBox/ToolBox files or Excel Spread Sheets), it is important to transform those into usable data. For the FLEx user that means importing them in to FLEx. There are several tools to help with this. Built into FLEx is a SFM importer which can import from ShoeBox/ToolBox. Some SFM elements these are automatically connected to the appropriate fields in FLEx for custom SFM elements this will mean some extra work to find the most suiting fields (consult the FLEx website for details). For importing data from a spread sheet the process is a bit more involved but doable. SheetSwipper, a utility also from SIL, will take data from an .xls file and turn it into a file with SFM elements. This file can then be imported to FLEx with FLEx's SFM importer.
One of the challenges with FLEx is being able to track the source of lexical elements which have been added to FLEx. I first posted about this in my review from teaching at a FLEx workshop in Malaysia in 2011 (last paragraph or so). It is not that an elaborate system of references can't be conceived used in the simple field provided by FLEx, but that it is not really agile or flexible in the right ways to handle citations in ways that would respect the different values contained in a citation eg. page numbers, names, publisher, license of the work.


Usually the next phase in the research process is writing the grant, or procuring funding. The technical aspects of producing the grant application (a written document) are very close to analysis and the write-up of data durring or after the on-field location phase.

Many linguists like LaTeX as a publishing system, because it allows some semantic mark-up for styling and it allows for the linguist to make universal changes via style sheets. SIL offers XLingPaper, a plugin to the XMLmind XML Editor application for composing papers in structured XML. This system, like LaTeX allows for the semantic codification of documents. Of course the the next level of questioning is: is the semantics oriented around linguistic knowledge or around the publishing domain? We must leave this question unanswered for the moment because it is up to the individual author to implement the power of XML to their uses. The benefit of XML over LaTex is that it the text based manuscript can rapidly be out-put in various stylesheets or converted to other structured knowledge forms. Another huge advantage of using XLingPaper over LaTeX in the FLEx eco-system is that the coping of interlinear text from FLEx to XLingPaper, and the auto generation of draft grammars are possible based on data collected and stored in FLEx. The drawbacks of both of these systems are the high learning curve that they take. The difficulty for users is not primarily the Graphical User Interface (GUI) difference in using new authoring tools, but rather a difference in how a text based expression is formed. Most computer users, Ph.D. and M.A. linguistics students included, are not new to computers. Most of these user's computer based authorship for the last 10 years (assuming that they typed their first paper in 6th grade) has been based in Microsoft Word. Microsoft Word, while powerful and flexible, does have an impact on the cognitive point of reference for authors. So, while the new GUI will be a challenge, including having an impact on the normative creative process of authorship, the greater part of the challenge for the new LaTeX and XLingPaper user is relearning how to structure their documents and marking semantics. (And all this while not having immediate visual conformation that they positioned the text on the page in the correct meaningful way.)

Authorship with Citations

Both LaTeX and XLingPaper can and do handle (allow the import of) citations as they might be stored in Zotero. In XLingPaper the process for importing a citation is not exactly beautiful, in the sense that OS X's Pages and Endnote have created a logical framework for adding citations as they are needed by the author.

Authorship as conducted by minority language speaking members of the project

In many projects, minority language speakers become authors or contributors of written texts, books, readers, etc. These participants often but not always have less computer and technical exposure than the Primary Investigator (PI) of a language documentation project. The training of local authors is a very important part of sustaining language vitality, and can deepen the good working relationships between the PI and the community. However, neither LaTeX nor XLingPaper are the sorts of tools that are easy to learn. Software suites like OpenOffice and LibreOffice are much more suited for these sorts of authors and tasks. They are also cross-platform, open source, unencumbered by internationally restrictive licenses (as many Microsoft products are), and do not encourage the piracy of licensed software. To assist the users of OpenOffice products who want to import sections of interlinear texts from FLEx SIL has produced a plugin for OpenOffice called OpenOffice Linguistic Tools.

Collaborative Authorship

While we are considering indigenous speaker collaboration with the PI, it would be pertinent to also consider how the PI may want to provide templated documents (like a draft of the dictionary) to indigenous speakers for review. This is possible through an SIL product called Pathway. Pathway relies on formatting specifications set inside of FLEx and will then export content selected in FLEx to .ODT (OpenOffice), Adobe InDesign, e-pub or some java based files for use on java based cellphones. Of course the Open Office files can be altered in Open Office before printing. However, care should be taken and it should be explained to indigenous authors that changes in the ODT version will not result in changes in the FLEx database, or in the next output from FLEx. Changes need to be manually entered back into the FLEx managed data set (database).

Digital Distribution and Formal Publishing

Since many of the tools for printing locally are used by the research group in workflows on-location it makes sense to discuss some options here which are available but are likely to be more intentionally used by language documentation collaborators closer to the end of the project as participants look to create products. As was previously mentioned, Pathway can be used to generate e-pub files. These files can then be used on a variety of devices. Importing these files into a tool like Calibre will allow the creator to switch the e-pub file formats to formats targeted for specific devices. Calibre will also allow the creator to embed appropriate metadata if these digital outputs are to be submitted to an archiveJutoh is another cross platform app which might be of interest to the digital publisher. It allows for digital books to be created and to be submitted to multiple comercial distributers.. One aspect of the e-pub and e-book eco-system which is worthy of note, is how content on consumer devices is or is not kept up-to-date. In closed market systems of distribution like Apple's iTunes book section and Google Play's book section, when a new version of a pre-existing digital object is available it replaces the the older object on the device. This can be very useful if the project is publishing drafts and has a quick release cycle for new drafts. However, if the distribution model is manual then several versions of a draft may be downloaded and retained on devices. This may be good for comparative work looking at changes between versions, but may also be bad for communities as they might struggle with competing versions of the same digital object (content). Confusion may not be problematic for the project directly but there may be long term impacts on the way that community members perceive digital objects. That is to say, if your project proposes to have a digital object distribution, then also consider how versions are going to be managed and marked and what the impact will be on first time digital object users. Will digital versions confuse users or make the community distrust that they have the most current version?

Another tool which fits (albeit like a disjointed arm) into the publishing end of the FLEx eco-system is LexiquePro. LexiquePro is an application which allows its users to produce a CD-ROM version of a dictionary, an HTML file based dictionary output, along with several other options. LexiquePro was originally designed to "pre-flight" or "prep" data from ToolBox (another SIL application) and to create consumer oriented products. However, as the Lexique project evolved it soon boasted a dictionary editing user interface which was preferable (to some users, including linguists) over using Toolbox. This means that those linguists' work is usually still wrapped in the LexiquePro environment as these changes are not synced back to the Toolbox (or FLEx) data set. From what I can determine (as of 2013), most current LexiquePro users find it useful for creating online versions of their lexical data (or simply use LexiquePro instead of ToolBox or FLEx).

The ability to present content online is often the benchmark test for modern data management. If a tool does not get the data on-line then many users may ask: Why am I using this tool? However, there are a lot of options and opinions which come with putting various kinds of data online. Most of the architecture questions and the mechanics of how things are put on the web rapidly are forgotten once users start looking at the website and learning how it "feels" or "behaves".
So, while many linguists simply want to "put their data online" they also want the "online representation of their data" to look nice and work well across browsers. Some even want people to look at their data and provide corrections or critique - this implies interaction with the data. For this kind of interaction a Content Management System (CMS) is best suited for managing the data and the interactions between users and publishers. This CMS option can be contrasted with LexiquePro's options which are HTML file based. SIL also has an option for the CMS web based viewing of lexical data. It is called Webonary. Webonary is a WordPress plugin which ingests the XML output of FLEx and imports to WordPress. WordPress is a highly recommended CMS and is the most popular web CMS for new websites as of 2012. While Webonary does ingest the lexical data into a database, it falls short of creating a collaborative environment for a community to create their own dictionary and does not link to sources or resources which may be held in a language archive or in an institutional repository. I have reviewed Webonary in a separate post.

All of this conversation about sharing data (especially lexical data) has not gotten the language documentation team any actual data. This conversation (post) to this point has assumed that the language documenter has already input data into FLEx (so that they can share it). Now language based data can come in several forms. Of those forms which it might enter the project, the two most popular are oral and written (oral being sometimes captured by video and sometimes by audio). Perhaps a better way to look at this would be static data verses dynamic data (something which does not play over time verus something which has a time alignment component). In my limited experience, when I ask people about their data management plans in their language documentation projects the predominant view is that original or "new data" collection process starts with audio or video recordings. This leads us to the capture event - not the event being recorded and documented but rather the event of managing the recording device durring its operation.

Informed Consent, Intellectual Property, Licenses & Personal Socio-linguistic Profiles

However, before diving into the capture of Dynamic Data there are two types of data which which need to be collected and are often overlooked, or undervalued in the data management plan. These two types are (1) Contracts / Informed Consent / License data about the relationship between the contributors to the project and the project implementers and (2) a socio-linguistic profile of the data contributor (language speaker).

The first type of data let the project participants and the world know what they can do with the data and help to establish a common understanding between the collaborators of the project. This will help determine what is appropriate data for the project to collect and how to handle it in terms of distribution, archiving, and future access. This data type, while it might originate as an oral agreement, or a paper agreement (as most US base IRBs would expect by default) can be can be written in prose or scanned. It is more beneficial to the project (both immediately for local data management, and for the archivists later in the project's history) to codify this agreement at a per digital object level in the metadata schema.

The socio-linguistic profile will help researchers know if they are getting the right kind of data (from the right social sectors of the language community), make appropriate connections about the data (especially in the sorting and the relating of bits of data in one part of the project to bits of data in other parts of the project), and for future users of the data to be able to understand where the recorded generators of the data were/are situated in the community (the speakers' environment and social influences on that speaker). This data need not be connected directly in the metadata schema to each individual file, but rather with appropriate technology could be linked to using a relational data base model. However, codifying this data into a schema which will allow the users of the data to sort other data files is still important - even in the data acquisition stage of the project. (More thoughts on socio-linguistic profiles.)

As far as I know, there is no tool which helps the language documenter manage these kinds of relationships or data types. The only tool which comes close, and is not necessarily in the FLEx eco-system is SayMore. SayMore is an application (Windows OS only - Not cross-platform) produced by SIL International which can do some informed consent tracking and some tracking of basic metadata about speakers (which could be considered part of the socio-linguistic profile). However, integrating this data with lexemes in FLEx or connecting audio recordings in SayMore to lexemes in FLEx are not possible at this time and it is unclear if it will ever be possible. It should be noted that there are some transcription capabilities in SayMore and that these transcripts can be exported to ELAN. ELAN transcripts can be imported to FLEx as general texts and then a user may use text parsing tools within FLEx with to add lexemes to the lexicon. However, once the data is in FLEx there is not necessarily a linkage back to SayMore data about the speaker.

Dynamic Data Capture

The capture process for Dynamic Data as I discuss it here is considered to be the act of recording, but this has at least three aspects which need to be considered:

  • Why am I (the documenter) using this mode to capture the event? What is it I am capturing? Why record this? Why use audio? Why use video?
  • Am I doing an experiment? e.g. Is there a stimulus which I control?
  • What is the device I am using to capture the event? Devices have and invoke different file formats upon the researcher.

As we consider the software needs of the language documentation project, the first question above is important because it's answer will speak to the kinds of editing software the project will need to use for the captured data. The second question is important because it speaks to the stimulus which is another object which should be added to the collection of data the documenter is amassing (the corpus). It also implies that there should be a codification of the association between the stimulus and the result. The final question is important because different recording devices:

  • Have different firmware which might need to be updated and noted in the project logs.
  • Produce different kinds of digital object formats.

While it is agreed that WAV capture is the preferred archival format for audio, video formats are a moving target. Capture devices often have a variety of settings for the user to negotiate. Some of these settings include variations on the capture format. This can mean that documenters will need to convert the video files from the capture format to an editing format for processing and analysis. Editing applications should then provide the documenter with options for creating product specific presentation formats.

Video Formats

Working with video requires a greater understanding of the dynamics under which the final product is going to be used. It also requires understanding the technical process from capture to product - even if documenters are collaborating with video editors/producers.

The basic workflow and format variations across the Capture-Edit-Distribute spectrum are not uncommon to any of the modes of the language documenter (text, audio, video, image, geographical data), but become more obvious and important in the audio and video modes. Particular software options for each stage (including format conversion tools) are going to very depending on professional quality (requirements) of the editing tool and the operating system (OS) of choice.
In terms of FLEx and its eco-system and the choice of software for capture, editing, and organizing of video is un-suggested. There is nothing specifically required by FLEx nor designed to interact with this media workflow. The eco-system is agnostic to dynamic data because FLEx only uses local links to reference dynamic data. Once the FLEx data base is moved to another computer those links are not guaranteed to continue working.

In the language documentation project there are two perspectives on media. These perspectives are not always in opposition to each other, but do affect how media producers/creators work with media. The first perspective is consumer centric and that the recordings are the basis of products for the community. These production goals then push documenters towards product oriented workflows for consumers (products I have encountered to date have generally had a historical, educational, or entertainment oriented focus). This perspective is of importance to language revitalization efforts and often notions held by documenters that they should give something back to the community they work in and with. Providing video and/or audio recordings often falls within the scope of what documenters feel is feasible to do. The second perspective is that the media is data of linguistic importance and needs to be transcribed and annotated for linguistic analysis. Workflows producing media and products of concern primarily to linguists and pedagogical professionals working within the community generally focus on transcription, annotation, or translation of the recorded event. Software is needed for media handling and management from each of these two perspectives.

Transcription workflow

Transcription workflow showing movement from recording through transcription to translation to finished product.
Image from: Berez and Cox 2010. [2]Andrea Berez and Christopher Cox. 2010. Presentation at inField 2010: Aligning Text to Audio and Video Using ELAN. June 22 - July 2, University of Oregon. URL: … Continue reading

If we take the workflow presented above in Berez and Cox (2010), which seems reasonable considering that editing should happen prior to transcription in order to keep content time aligned (but this also means that the embedding of metadata into media files should occur at the end of the editing stage along with MD5 hashing). Then the first step is to look at audio (and video) editing tools. In terms of audio the open source editor Audacity is reasonable to use, open source, cross-platform and currently maintained. The one hesitation I might propose is the use of Audacity for embedding metadata in audio files and this is because I have not used it to do that yet.

With respect to video, Apple's products (Final Cut Pro) are generally considered the best for editing video projects ranging from amateur videographer to Hollywood feature film. However, Apple's tools only run on OS X. Adobe's video solutions (Adobe Premiere Pro CS6) run on both OS X and Windows, but not on Linux. Both Apple and Adobe's software come with a price tag, which if the documenter has the skills to use these tools can be worth the cost to avoid hassles introduced by other tools. There are some open source, cross-platform solutions like Avidemux and VideoLAN Movie Creator (still in early development as of 2013).

As the diagram above illustrates, the step after editing is transcription. Transcription does not always imply writing (as with the BOLD method [3] Reiman, D. Will. 2010. Basic oral language documentation. Language Documentation & Conservation 4. 254-268. [PDF] ), but generally does. With writing comes necessities of keyboarding and keyboard layouts. I address this more below under static data collection and more deeply in Keyboard layout as part of language documentation: the case of the Me'phaa and Chinantec keyboards. [4]Hugh Paterson III. 2012. Keyboard layout as part of language documentation: the case of the Me'phaa and Chinantec keyboards. Paper presented at: Language Endangerment: Methodologies and New … Continue reading . Eventually questions about the graphical representation of the oral production are asked. This often leads to investigations of the phonetics and the phonology of the documented language, which in turn can lead the language documentation team in the collection of new genres or more targeted tokens of data. Investigating phonetic and phonological aspects of a language is not out scope of this article but likely deserves its own section.


Illustration from the fine folks at the University of Melbourne.

While ELAN is a fantastic tool for performing the transcription task there several things yet to think about:

  • Other applications which can be used instead of ELAN (Why ELAN and why not use these other applications?)
  • The organization of the team around the transcription task (Who does what when?)
  • The use of ELAN transcriptions into other applications like FLEx (Ok, now it is in ELAN what can we do with it?)
  • Ok I did something with my data, now can I look at this back in ELAN? (Import back to ELAN)

To address the first question about why use ELAN instead of these other options? we must take a look at some of the other options. TranscriberAG which was targeted at the transcription of audio materials hasn't been updated since 2009. The Mac version only works up to OS X 10.5 (I couldn't get it to install on OS X 10.6.8) and there are several reports that it does not work in Windows 7. Transcribe!, a cross-platform application (OSX, Windows, and Linux), is targeted at the musician and helping them determine the notes of the recording. It also has a way to for the user to transcribe text and can play videos via QuickTime. But it lacks the auxiliary support from a vibrant linguistic community behind it. It is also not designed to support multiple tiers or scripting. FOLKER from the Archive for Spoken German does have an XML output, but users would have to be able to transform FOLKER XML to XML which is usable by their other software. Another limitation is that FOLKER is designed to only facilitate the transcription of audio not video. The final application (in 2013) which I have found which is still in use among linguists for transcription is EXMARaLDAThere is also Praat, Emu, Poio, and SayMore, but I already mentioned SayMore and will mention Praat later under the phonetics and phonology section. Emu seems to remain a mystery: Is it for phonetic analysis, is for corpus management, is it for searching wav files for sound patterns? Poio is the work of the CIDLeS group and seems to replace Kura which would allow one to open a ELAN file and then add tiers with grammatical annotations in them. The new elements of Poio seem to include some corpus management and transcription but I am not clear on how usable, useful or aplicable they are (as of 2013).. EXMARaLDA seems to be part of its own eco-system, with connections to almost any other well used transcription tool (Praat, Emu, ELAN, etc.) I am not sure where it should fit in the language documentation workflow, or if it is more suited in the world of audio corpus analysisIf a reader can give me input on how or why to use EXMARaLDA it would be much appreciated. There looks to be some nice browsing features of corpora which have used EXMARaLDA. However, strictly speaking, deciding on a tool to use based on the kind of presentations one can do with the ensuing corpus is not the best way to evaluate a tool, but is likely how most of us evaluate tools. It should also be pointed out that Nick Thieberger as a very nice looking presentation for a digital corpus on eopas:,5.485.

We still haven't completely answered the first question: Which transcription software should used by a language documentation team? Part of the answer is in logistics and answering the second question: Who does what when? Before I digress and discuss workflow related issues, it is important to take a look at the physical arrangement of computers, backup solutions and local network layout. While not every field setting has stable power sources or security for lots of computers, when we start to talk about a language documentation team we infer collaboration. And collaboration must also happen digitally. Let me offer a network senario based loosely off of my previous experience in Mexico in 2010. Though the image below could be considered "elaborate", I think it is very realistic. Even if we were to reduce the digital collaboration to two computers, a backup and assest (large video and audio files) management solutions would still be needed.

Hypothetical Language Documentation network

Hypothetical language documentation network based on experience in México.I fully realize that there are quite a few assumptions here. But even in a minimalistic case one would expect 2-3 computers and a backup solution. Collaboration at this level also means deciding how that data will be referenced or accessed by collaborators.

Collaboration infers complexity with respect to network storage and data management. But perhaps the greatest challenge is to create an agreement among the documenters/investigators/researchers about the organization of files in the corpus. My experience in workgroups is that the perception of usability and organization is strongly based on visual arrangement of files. This gets complicated when there are differences of opinion based on how individuals want to work with files or when files could fit under two types of folder branches (say we had one branch for all of one speaker's files and another for a pan-lectical survey sample the same speaker provided). In some sense an abstraction layer is needed which gives individuals the ability to organize files like they want to so they can proceed with analysis. This layer of abstraction speaks to the need for a local-network (but not necessarily only local-network) corpus management solution. (I have some opendraft thoughts about an implementation and feature set for this which was started in 2011.)

Supposing that a multi-user/network based corpus management solution is not used, and that a tree/folder based system is agreed upon. A team can still collaborate. It is just harder without a corpus management solution. Even with this agreement there are other issues to consider with respect to collaboration. Collaboration maybe several people contributing to a single corpus or it might be several people enriching a single item within a corpus. Facilitating these different notions of collaboration put constraints or requirements on software. If we stay with the transcription task, one way to approach distribution of edits to transcription files in a collaborative fashion may include copying media to each persons' machine and then placing transcription files in a mercurial folder where changes to just the transcription files are passed between the team member's computers so that everyone has access to the latest changes to the transcription files. Then when someone makes a change to one of the files they just have to push it to the master file version control. Another way to approach the distribution of edits is to have a central database where contributors must login and then edit the central source.

Yet a third way to collaborate is to divide the workflow up so that each person handles all of the media but only does specific things to it. If person 1 does the editing of the media, then person 2 does the phonetic transcription, then person 3 adds the grammatical parts of speech to the POS tier.

Large LD Workflow

Language Documentation Workflow where each participant enriches a single object, adding crucial elements from their area of expertise.

This style of workflow division works well with the translation part of the task. However, in both ELAN and with FLEx there is the challenge of working with team members who may become easily dismayed with multi-task oriented user interfaces.

To assist in the implementation of collaboration of this kind Adam Baker has produced two pieces of software which help the team move the transcription and translation forward. The first of these is ElanCheck.

ElanCheck is a low-tech, no-expertise-required way to edit the transcriptions in Elan (.eaf) files. Using ElanCheck, a language helper can transcribe a segmented audio file, without having to confront the frightening Elan interface.

The second of these is FreeTranslator.

FreeTranslator is a low-tech, no-expertise-required way add free translations into FlexText files (for Language Explorer). Language helpers can use this tool to add free translations to FlexText files without having to use Language Explorer.

The designers of FLEx took a simular approach to eliminating complexities by designing WeSay. WeSay allows the language documenters to create custom fields for the collection of lexigrapical entries and then import those WeSay files into FLEx. This process still requires the defining of workflows, but allows WeSay users to focus on specific elements of the lexical entry.

At this point though there is still no software in the FLEx eco-system which helps manage the lexical database enrichment process. (Manage workflows which result in enriched lexical entries.)

Data Exchanges

The notion of collaboration basically, presents two options for data exchange: a centrally hosted data store which everyone works off of, like a server hosted Database, or a locally managed and used data set which gets synced with collaborators durring exchanges. In the FLEx eco-system there are three options for data syncing. LiftBridge, FLExBridge, and Language Depot. These services each do different things but enable the team using FLEx to collaborate on FLEx managed data. Not all the data in the language documentation project as indicated above is able to be stored in FLEx - it is not a corpus manager. Therefore there are still synchronization needs which are not met with these three options.

Other text based files like ELAN or other transcription files can more easily be synced with tools like git or mercurial.

There is an interesting product called Transformer. It allows the importing of various transcription formats (by Praat, ELAN, Exmarlada, etc.) However, it only works on Windows. But if the project requires the use of several transcription formats this may be of some use. It is not immediately clear if there is the expectation that transformer also acts as the corpus management solution as well or if it simply handles file format changes.


Positioning of Transformer among text formats.
Image from Transformer's website.

However the data exchange challenge is more difficult with larger files containing audio, video and still images. To my knowledge, this data is still best managed in a central location with team members accessing a central data store.

Section in draft.

Corpus Management solution CoMa:

Edit/Workflow to archive copy for transcription

Audacity workflow

Audacity workflow for splitting audio files based on ELAN transcriptions.
Image from: Berez and Cox 2010. [5]Andrea Berez and Christopher Cox. 2010. Presentation at inField 2010: Aligning Text to Audio and Video Using ELAN. June 22 - July 2, University of Oregon. URL: … Continue reading

Annotation v.s. transcription

Section in draft.


Text Analysis Markup System

Phonetics and Phonology

Section in draft.

Phonology Assistant

Sound Snack tool Kit:



Grammar analysis write up

Section in draft.

Geographical Information Data (GIS)

Interest in geographical relationships between people and language use are on the rise. Linguists, historians, anthropologists, cartographers, sociologists, and language documenters all have interests in the relationship between people and their environments (not to exclude military, lawyers or corporations which also have an interests in land use and wish to know who speaks what language and where they are from or where they live). In recent years dialectical variation analysis has shifted from measuring language change on a geographical basis to a social-network basis. In some ways this de-emphasizes traditional notions of value in space based plotting of data (maps). But GIS data is not to be forgotten in the language documentation project. Knowing which island in a chain (or even where on that island) a persons comes from is still helpful in understanding the interactions of social-networks. Mapping speaker variation with local geographical features can also reveal interesting relationships between geography and worldview. While the effects of globalization, have in general encouraged people to travel more it has also brought us the handheld GPS unit making it easier to gather GIS data. In recent years device manufacturers have standardized transmission options around USB and BlueTooth technologies. Data is often transmitted off of these devices in an XML format known as GPX. Many times device manufactures provide utilities for transferring data into their applications. (My experience is with Garmin.) However, some of these GPX files need to be edited prior to archival. Additionally, many times the data from the GPX files becomes metadata on other files. Sometimes editing the GPX file is necessary (I have found no good solution for editing GPX files). An additional challenge with relying on software provided by device manufacturers is that icons that the user detects/or chooses on their handheld unit may not be the icons which show up in GPX browsers. My experience in using GPS data in a language documentation project has led me to use GPSBabel, and open source, cross platform utility which holds a deep respect in the Geo-cashing and GPS users communities. GPSBabel has a long history of working with a variety of GPS unit types even before manufacturers standardized on USB and BlueTooth. It can also transform data from one format to another (GPX to something else or something else to GPX).

In terms of creating maps or merging other GIS data with newly acquired data there are two open source and cross-platform GIS applications which are respected and offer a lot of options: Quantum GIS and GRASS GIS.

With respect to FLEx and GIS data, FLEx does not particularly look to handle GPS data on speakers, events, where texts were collected, or on where lexical items were heard used. This is not particularly unexpected given the traditional lexicographic view which has gone into FLEx architecture (rather than a web 2.0/3.0 data mashup architecture). However, it would be nice to see FLEx become GeoAware so that linguists, language documenters, and lexicographers, can analyze corpora, texts, and lexicographical datasets by social-network or geographical region.

Static Data Collection

Static Data differs from Dynamic data in that it does not have a time aligned segment to it. For example photos, vs. video. There are a variety of data types which fit this category and deserve special attention. Cultural observations, anthropological notes, genealogical data or kinship data, texts which are produced in a written form rather than an oral form, dictionary data (which doesn't come from oral texts) like lexeme associations or relationships and definitions, photos and images.

Cultural and anthropological observations

Section in draft.

Inside FLEx

Kinship Data

Section in draft.

SILKin and KinOath


Still Photography is perhaps the stepchild of the language documentation modes. It has been around for some 150 years at this point, but paintings preceded that by centuries (or longer if we count cave paintings and pottery paintings). However, it is often eschewed for video use or in favor of creating audio recordings. These dynamic data artifacts are often seen by linguists as being more valuable and since language documentation has its roots in linguistic traditions it is natural to expect theses kinds of artifacts. But language documentation as an emerging interdisciplinary field of study and art needs to reconsider photography and what it can learn from visual anthropology. The place of a well produced image to capture the essence of the subject, to evoke a mood, or to describe a cultural reality is not to be underestimated nor should the connection between the photographer and the subject be ignored. There are several things which can be said and should be articulated before we consider software and photos as they relate to the language documentation project.

Image capture vs. Photography
Perhaps the first thing to consider when considering still photography is the nature of the image. In the early days of photography film was expensive. The over time the cost of photos dropped and they stopped being viewed as special occasions, or works of art. Then the digital camera happened, and somewhere around 2006-2007 everybody and their neighbor had a camera. The world and the internet got accustom to seeing low quality digital images in terms of both pixel density and in terms of artistic composition of the image created. By 2013 we see lots of cameras (even on cellphones) taking images at 8 Mega-Pixels. This often helps with the pixel density issue but it does not necessarily help with composition, or color accuracy, or mood of the image. There for it is important to look at images and say, there are photographs and then there are images... both have their place, but the difference between the two view points can speak volumes without words. In language documentation, and the creation of image based materials, it does behove us to think how lasting these images we create are. Lasting also in the sense that we need to ask not: do we have an image for that dictionary entry, or of that person, but in times to come, how many people are going to want to see that image? While it may not be the linguist's prerogative to create a DK quality children's photo dictionary with fascinating images; images have the power to draw minds into concepts and language.

DK Children's Dictionary Page

Page from a DK Children's Dictionary.

Image from DK's website.

Editing vs. Composing
As we think about the first difference between images and photographs, one of the next logical questions to ask is: What is the line between an image and a photograph? Is it the amount of editing? This line of questioning may lead us to a sharper definition. But in the process it develops the question of what is editing? Is editing the touchup process, is it the adjustment of colors, is it the placement of the subject into a background in which the subject never existed? As an amateur photographer who shoots RAW, with a cannon t3i and uses Lightroom, I want to acknowledge that post processing has always been part of photography - since the days of darkrooms. However, even as photographers might be using art to document culture, they must be mindful of the reality they are attempting to put into a documented form. So, there will be a line, and an unedited image or photograph is not always better because the camera does not capture everything the eye can see anyway.

The following photos and images show some of the ways that still photography can and does document culture, indigenous attitudes, geo-political environments in which the language and culture being document exists.

Brad Corrigan's work in Nicaragua from the Love Light and Melody project.

images from Nicaragua

Images from the Love Light and Melody project.

Images from Nicaragua.

Images from the Love Light and Melody project.

Eric Jones' work on his website: highlights some of the diverse people and scenery of the country of Azerbaijan. I particularly like this one as it shows cultural architecture, cultural color, language script, nationality, along with personal expression.

JoeyL also does some fascinating work with cameras and culture having images on his site from Indonesia, India, and Ethiopia.

I think it is important to point out that photography, even if the subjects are dressed in traditional atire and are posing for the photographer, has been a part of culture documentation for at least the last 100 years.

Displays of Pendleton woolen mill blankets with Indian man

Ed Chapman photographed by Lee Moorhouse between 1897 and 1920. Accessed via the University of Oregon Libraries.

Photo and Image Processing

Section in draft.

Composite imagery vs. hue saturation, color correction and vignetting, verses full "natural" imagery. What is documentation what is art? What is an image v.s. what does the eye see?

What is the celebration of culture? HDR?
Photomatix HDR, Lightroom,
Photoshop GIMP

Bibble Pro
Photo and Image Management
Google Pacasa
Apple IPhoto

Read this article: how should the language documentation team be set up?

Keyboarding and Keyboard Creation

Section in draft.

This really deserves its own post because usually each keyboard layout must be uniquely created for each OS - this requires software. That said there is MSKLC for Windows, and Ukelele for OS X and KeyboardLayoutEditor for Linux variants. (These are certainly not all the options out there but these are the ones which respect the operating systems' built in features.) However, the real art in Keyboard Layout Design is matching the User Experience with natural patterns in the language and orthography systems.

For the FLEx user there are two possible OS environments, Windows and Linux. MSKLC keyboards do work with FLEx on Windows. Other options are unattested by meInput is always welcome.

Social organization around tasks and workflows

Section in draft.

Who does what?
How is the data collected?
Is research analysis driven or is it social product driven?

Immediate public relations and web publications

Section in draft.


Corpus Management and Archive submission

MPI's Corpus tool. LAMUS:

Admin Software

Section in draft.

Remote Desktop



1 Harald Felgner. 5 December 2007. Aesthetics as a Business Requirement. Presentation: [Link]
2, 5 Andrea Berez and Christopher Cox. 2010. Presentation at inField 2010: Aligning Text to Audio and Video Using ELAN. June 22 - July 2, University of Oregon. URL: [Link]
3 Reiman, D. Will. 2010. Basic oral language documentation. Language Documentation & Conservation 4. 254-268. [PDF]
4 Hugh Paterson III. 2012. Keyboard layout as part of language documentation: the case of the Me'phaa and Chinantec keyboards. Paper presented at: Language Endangerment: Methodologies and New Challenges, CRASSH Cambridge, UK. 6 July

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.