As linguistics and language documentation interface with digital humanities there has been a lot of effort to time-align texts and audio/video materials. At one level this is rather trivial to do and has the backing of comercial media processes like subtitles in movies. However, at another level this task is often done in XML for every project (digital corpus curation) slightly differently. At the macro-scale the argument is that if the annotation of the audio is in XML and someone wants to do something else with it, then they can just convert the XML to whatever schema they desire. This is true.
However, one antidotal point that I have not heard in discussion of time aligned texts is specifications for Audio Dominant Text vs. Text Dominant Audio. This may not initially seem very important, so let me explain what I mean.
Audio Dominant Text
Are those texts which were first captured orally. They are a performance. The content may be prepared (in the sense of the telling of a traditional epic), but they are not read texts. They are given from memory. The text lives. In terms of time aligned transcription, from this performance is transcribed. This oral text type stands in contrast to the Text Dominant Audio.
Text Dominant Audio
Text dominant audio is where a text is read, scripted or prompted. Several examples of this exist, a radio drama, a cartoon/animated film, a reading. In linguistics a wordlist would also classify. The guiding principle is What is guiding the audio? Is it the structure of the text based document or is it the image of the oral segment as it comes to the speaker spontaneously without visual or interactive reliance?
This difference should be noted in metadata durring oral text collection durring language documentation projects. In one project I was involved in, several speakers were recored who were reading texts which they had composed, written and typed. Some editorial help was provided by the PI on some of the texts. These texts now exist as both oral and written texts (in terms of modality). But is the written form a transcription? I doubt it. It is like the inverse of a transcription. The oral but these texts can be presented online similarly to Nick Thieberger’s presentation of text through EOPAS. Example: http://www.eopas.org/transcripts/65.
I guess in terms of XML, my outstanding question is: How should the structure be represented? If the reading is only a reading, then should the text be head in the XML document and the audio matched to that? Obviously some would say that even readings are performances and I do agree to an extent. However, even with these performances because the non-spontaneous nature of the delivery, and the previous editing of the text, is there a smoothing out of the text which might be equivalent to removing dialectical syntactic variation?