Several months ago, I posted a question to Facebook about digital literacy.
What is the role or place of Digital Literacy in a company that values literacy as being vital to reaching its goals?
I have had several months to contemplate the question and I realize that I was a bit ambiguous in my question, or rather my question could not have been understood concisely. Digital Literacy can and is used to mean Continue reading →
A document’s DOI (http://www.doi.org/ or on Wikipedia under Digital Object Identifier) is an important part of the citation of a document. Many style sheets allow for just the DOI of a paper as the citation. Because DOIs are unique they can act as URIs which are resolvable and look like URLs. However, a DOI is different than a URL for where a digital object might be located. It might be well argued that a DOI should be tracked in the metadata schemes of archives which collect language and linguistic data. Continue reading →
I have been trying to find out what is the best way to present audio on the web. This led me to look at how to present video too. I do not have any conclusions on the matter. But I have been looking at HTML5 and not using javascript or Flash. Because my platform (CMS) is WordPress, Continue reading →
Metadata is very important – Everyone agrees. However, there is some discussion when it comes to how to develop metadata and also how to ensure that the metadata is accurate. Taxonomies are limited vocabularies (a set number of items) where each term has a predefined definition. A folksonomy is a vocabulary where people, usually users of data, assign their own useful words or metadata to an item. folksonomies are like taxonomies in that they are both sets but are unlike taxonomies in the sense that they are an ope set where taxonomies are closed sets.
An example of a taxonomy might be the colors of a traffic light: Red, Yellow, and Green. If this were a folksonomy people might suggest also the colors of Amber, Orange, Blue-Green and Blue. These additional terms may be accurate to some viewers of traffic lights or in some cases but they do not fit the stereo-typical model for what are the colors of traffic lights.
Some examples of taxonomies might be the keywords on a book record in a library. A library might have only certain keywords it uses. In contrast to curated records at libraries, websites like flickr and delicious allow users to tag (or Keyword) their photos and links with the keywords which are useful to them. These are examples of folksonomies. However, the concept of user generated metadata goes beyond the folksonomy to the any and all user generated metadata. In this scope projects like LibraryThing and Bibsonomy deserved to be mentioned as sites where user generated metadata plays a powerful part of the organizational presentation of the content on the site.
So the question comes to how are managers of data, like web masters or librarians to ensure the quality of metadata? And also balance that quality with the usefulness of the metadata to the users of the data. So if visitors to the library can not find the book they are looking for because the way they are looking for the book (the terms they are using) is not supported (those terms are not associated with the record for the book) then the cataloguing record is not as useful to that person. But if the library opens up its records for everyone to edit the how is the library to know that the records are accurate?
In linguistics there are several important taxonomies.
In this context there is also a multi-lingual element, each term may have several variations across languages. i.e. Phonology in English is Phonologie in German.
And in library science there are also several important taxonomies.
And every company or institution is going to have their own special taxonomies for various purposes.
SIL International unique taxonomies
The challenge for “marketing” or enabling the rapid and useful discovery and association of resources is to spend as little effort describing resources as an institution and to allow users to provide accurate metadata which is helpful to them. After all their mental associations are very important to the use and discovery of relevant resources. So the question is how can users add metadata value to objects in the archive? And how can the institution trust these proposed added value elements? SIL International, as a host institution to the Language and Culture Archive is not alone in this problem space.
Basically what is needed is an algorithm for turning unstructured data into valuable, valued, authoritative, structured data.
A algorithm for turning folksonomies into taxonomies
As I have stated above SIL International is not alone in this problem space there have been several studies and use cases which have been done and published on this very kind of problem.
Analysis of User Generated Metadata in the Library Thing Folksonomy_Vincent Sterken
The use of social discovery systems is rapidly expanding, often building vibrant and interactive communities. Some public and academic libraries are trying out these systems, in which patrons can contribute ratings, reviews, and comments. While user-contributed metadata may not equal the quality of professional cataloging, it can enhance the catalog records with rich supplementary information and personal perspectives. The author’s examination of use of social features in two public libraries led to the discouraging observation that addition of user-generated metadata in these contexts was limited, in sharp contrast to other social sites. The question of motivation is key. People’s notions of library catalog records and their ownership by library staff may present an obstacle to contributing metadata. User-generated metadata has the potential to add value to records while conserving limited library resources. The challenge of promoting the active use of social discovery systems in libraries demands further research.
The Continuum of Metadata Quality: Defining, Expressing, Exploiting
http://www.ecommons.cornell.edu/handle/1813/7895
Like pornography, metadata quality is difficult to define. We know it when we see it, but conveying the full bundle of assumptions and experience that allow us to identify it is a different matter. For this reason, among others, few outside the library community have written about defining metadata quality. Still less has been said about enforcing quality in ways that do not require unacceptable levels of human effort.
Metadata creation system for mobile images
http://dl.acm.org/citation.cfm?id=990064.990072
User-Generated Metadata for ETDs: Added Value for Libraries Sharon Reeves
http://epc.ub.uu.se/etd2007/files/papers/paper-40.pdf
Making Use of User-Generated Content and Contextual Metadata Collected during Ubiquitous Learning Activities
During the last years significant research efforts have been conducted looking at how to standardize digital educational content. Due to better connectivity and computational power of mobile devices, new opportunities have emerged for collecting user-generated data based on the context and the environment where the content has been generated. While metadata standards for learning objects such as IEEE LOM make it possible to annotate digital content with pre-defined metadata tags, the ability to store custom user-generated or contextual metadata is not yet fully supported. The need for developing a flexible solution to deal with these problems motivated the design of our activity controller system (ACS), a rapid prototyping system and a task manager, which interprets, reacts to and stores contextual metadata and content extracted during learning activities. This paper presents how ACS facilitates coordination and reusability of user generated data, which we believe is as a valuable feature compared with existing standards and initiatives.
Annotea and Semantic Web Supported Collaboration
http://ceur-ws.org/Vol-137/01_koivunen_final.pdf
Mapping Entry Vocabulary to Unfamiliar Metadata Vocabularies.
Author-generated Dublin Core Metadata for Web Resources: A Baseline Study in an Organization
(( Jane Greenberg, Maria Cristina Pattuelli, Bijan Parsia and W. Davenport Robertson.. Author-generated Dublin Core Metadata for Web Resources: A Baseline Study in an Organization. http://journals.tdl.org/jodi/article/viewArticle/42/45, ))
This paper reports on a study that examined the ability of resource authors to create acceptable metadata in an organizational setting. The results indicate that authors can create good quality metadata when working with the Dublin Core, and in some cases they may be able to create metadata that is of better quality than a metadata professional can produce. This research suggests that authors think metadata is valuable for resource discovery, that it should be created for Web resources, and that they, as authors, should be involved in metadata production for their works. The study also indicates that a simple Web form, with textual guidance and selective use of features (e.g. pop-up windows, drop-down menus, etc.) can assist authors in generating good quality metadata.
A couple of years ago I had a chance meeting with a cartographer in North Dakota. It was interesting because he asked us (a group of linguists) What is a language or linguistic map? So, I grabbed a few examples and put them into a brief for him. This past January at the LSA meeting in Portland, Oregon, I had several interesting conversations with the folks at the LL-Map Project under Linguists’ List. It occurred to me that such a presentation of various kinds of language maps might be useful to a larger audience. So this will be a bit unpolished but should show a wide selection of language and linguistic based maps, and in the last section I will also talk a bit about interactive maps. Continue reading →
I have been Looking at different ways to make SIL’s digital research content more interactive, findable, and usable. Today I found http://research.microsoft.com/en-us/. It is interesting how they approach the facets of Location, Projects, Publications, and People up in the right hand corner. I think they did a good job. The site feels like it is balanced.
For the last few weeks I have been thinking about how can one measure the impact on a language due to a language communities' contact with other languages. I have been looking for ways that remoteness has been measured in the past. I recently ran across a note on my iPhone from when I was in Mexico dated March 8, 2011.
A metric for measuring the language language shift, contact, and relatedness of indigenous languages of Mexico
The formation of aerial features
Population density
Trade and social networks
Political affiliation
Geographic factors
Roads travel opportunities
I remember writing this note: I was standing in front of a topographical map showing terrain regions. This map also had the language areas of Mexico outlined. It occurred to me (having also recently had a conversation with a local anthropologist on the matter of trade routes and mountain passes) that as a factor in language endangerment that these sorts of factors should be accounted for and if it can be accounted for then it should also be able to be graphed (on a map of course). The major issue being that if one just plots a language area without showing population/speaker density in that area then the viewer of that map will get a warped view of the language situation. Population density also does not solely infer where language attrition will likely not occur. And language contact does not automatically happen on the edges of a language area. That is to say, in a country with mountain passes, there will likely be more language contact in the passes as various groups travel to market than in higher elevated mountain villages. This leads to the issue of language diffusion and the representation of language diffusion. But the issue is not just one of language diffusion, it is also one of population diffusion, and population mobility and accessibility to various areas. So in terms of projecting, assessing and plotting language vitality, considering remoteness should be part of the equation. But remoteness is not just a factor on its own, it is more of an index considering the issues mentioned above but specifically considering the issues of geographical remoteness and considering the issues of social remoteness (or contact, even with other villages and cities in the same language and ethnic communities).
I am not currently aware of any index, much less a project which plots this index to a geographical area. However, I have found some previous work worth mentioning which might be related and relevant.
Modeling Language Diffusion With ArcGIS
There is an interesting paper and project on modeling language diffusion with ArcGIS. It was prepared for Worldmap.org by Christopher Deckert in 2004 and presented at the 24th ESRI users conference.
Remote Areas of the World
The magazine NewScientist has an article from April 2009 about the Remotes places in the world it has several maps and abstractions showing how remote (with reference to travel time) places in the world are. The following maps come from the NewScientist article.
Map showing the access ability from one point to another.
Detail of roads in west Africa
Map showing the remoteness of the Tibetan Plateau
The ASGC Remoteness Structure
Another promising resource I found is the ASGC Remoteness Structure which Australia has developed to show how remote parts of Australia are. There is a series of papers explaining the methods behind the algorithms used and the purpose of the study. One of the outputs was the map below.
Australia Remoteness Map
The Territoriality of Public Health Governance in Mexico
The last resource I am going to mention here is The Territoriality of Public Health Governance in Mexico. A study which plots the Remoteness of Health Care in Mexico.
This paper is motivated by an experience in collecting, analyzing, and then redeploying (sharing while making relevant to other corporate SIL functions) corporate intellectual assets. These assets are relevant to both products SIL products and services and corporate processes. This paper attempts to document some of the current challenges presented to the SIL staff person as well as present some items for consideration in overcoming these challenges.
The Context
In preparation for the Me’phaa Language Documentation Project (Mexico) partially sponsored by the NSF our team has done some research related to GIS data and mapping the geographical distribution of the languages being investigated. This research has involved contacting the Ethnologue Cartographers Ireene Tucker, and Matt Benjamin. Both have been very helpful, providing the Ethnologue’s data points for inhabited places and the polygons (shape files) showing the distribution of the languages being investigated. It is our teams hope that through our research and collaboration with the Ethnologue department we might improve the geographical accuracy of Ethnologue maps . In addition to the improved accuracy, in the event that our research results in a change to the ISO 639-3 codes, as in the addition or combination of languages to the code, that we would be able to provide the GIS data relevant to those changes. However, it is realized that the ISO 639-3 code registrar or standard does not keep track of language points or language area polygons. This is a function of the Ethnologue, not the ISO 639-3 standard.
Some research questions
To reach these collaborative objectives at an academic level of quality we have had to ask several questions:
If an SIL staff researcher (or non-SIL staff researcher) has new GIS data, how do they submit that data to SIL? Then once it is submitted to SIL, how does the Ethnologue editorial team access and use the data?
If a researcher wants to obtain GIS data from SIL, how do they go about getting that data?
When that researcher wants to update the data that SIL has how do they go about submitting these edits to SIL?
How does SIL process and track the edits to the map and GIS data?Are these edits referenced to a research document? Yesterday’s polygons might have been accurate yesterday, and new shapes may reflect language shift issues, how is this change reflected to the end user of the polygons?
How are the sources for the maps tracked; how do we, as academics cite these data sources? (We could cite the Ethnologue but the Ethnologue is not always original research. As academics we are interested in and concerned with the Ethnologue’s data sources. These sources are not just the linguistic facts but also the place names, dialect or language variant names, latitude, longitude, altitude, datum, epoch and sources.)1
Because I am an SIL staff researcher, and a person familiar with (some of the) SIL business processes, these questions have lead me to ask some questions about SIL corporate processes.
Does SIL collect, track, curate, store, and otherwise handle GIS data related to its language projects and treat this data as valuable intellectual property as it does other kinds of intellectual property?2
Is SIL International corporate data systems prepared to exchange data with field teams and other researchers or communities?
Does SIL manage and deploy this data? Or is that solely the responsibility of the Ethnologue under its business department (an organizational unit within SIL International)?
The Current Process in SIL of creating Ethnologue maps
As I looked for ways to share and improve language data, and verify sources for data which are used to create SIL’s maps I learned some very interesting things. Mostly about the business model which is employed to create the maps used in the Ethnologue, but also about map and GIS data in general.
Maps are made up of layers of certain kinds details being applied on each layer. So the rivers might be in a layer, the county borders in a layer, the national borders in another layer, etc.
All this data does not make up a map. A map is a selection of layers presented in an image. A map is a product not a data set. In a sense, a map is a visual analysis of data, a selection of sets of details. If a researcher wanted to reuse that data or to verify that data was accurate, then the data, not just the analysis needs to be accessible, usable, and citable. For the most part this was not possible with the Ethnologue maps. Let me generally describe the data gathering an analysis process. This process is roughly approximated in the diagram below and may be somewhat simplified from what actually takes place.
SIL GIS Data Processes
What this process roughly looks like is:
A researcher, does some sort of linguistic investigation and collects location and place data about where speakers of minority languages live.
Name and approximate place data would be passed on to appropriate administrators in the form of reports. The data might also be published in a journal article or some other such academic venue.
Finally a conversation would occur with SIL cartographers, working for the Ethnologue for a specific area of the world.
Cartographers would look for the place names provided by the researchers and then find the place names on GMI’s dataset of places in the world. There are two issues which present themselves with this stage of the communication flow:
Some of the coordinates in the GMI data set are rounded and today with GPS technology, more accurate data coordinates can be found.
The next stage in the flow of data is for the cartographers to take the data they have gleaned from their conversations and to create shape files (polygons) out of it. 3
These shape files are then loaded together and produced into maps. Maps which are part of a final publication, like the Ethnologue.
In regards to the collection of GIS data concerning minority language use, the fundamental question being asked is how do I create an accurate map for an SIL product? Not how do I enable people to visualize language related data on geographical overlays and thereby foster collaboration among interested parties? In that sense, SIL runs a map making operation which is product centric rather than an operation which is service and sharing centric. Now, SIL does enable their maps to be shared (for a price through GMI), and one can hire an SIL cartographer to create custom maps. So, this might be considered to be service centric at a different level. However, this is not the same level of data sharing and enabling that say Google Maps or LL-Maps enables its users to share and use GIS data. The saddest part of this is that this affects SIL’s efficiency with respect to SIL staff researchers being able to collaborate on the maintenance and use of GIS data.
Is it current fault in corporate information structures, that this data (GIS Data) is not considered a corporate asset?
The current organizational structures prevent the use of cartographers without cost to internal researchers. This cost is restrictive both to field researchers and to corporate publishing. But more to the point, the service being offered is not really what linguists want or needed. What is truly needed is a method for linguists to intact with the data they are providing and exchanging and create their own maps which tell the stories they are trying to convey. Then if the Ethnologue presents data based on data offered through such an interactive service and platform knowledge provided from fieldwork can be appropriately cited. As for the results from the language documentation project in Me'phaa these results can be viewed in SIL Mexico's electronic working paper series, particularly Las Conexiones Externas e Internas.
Notes
↑1 It might appear that geographers, cartographers and GIS practitioners do not generally cite their data. (Hoch and Hayes 2010 p.23-24)
↑2 This would assume that SIL International has a corporate value for valuing intellectual property. Intellectual property could be seen as either an asset or a liability.
↑3 This seems to be common practice for language cartographers as of 2006.
The Ethnologue as an academic book, is somewhat of a straw man in linguistics. Many people who write grants for language documentation projects (generally on under described or endangered languages) will cite the Ethnologue and some other resources or lack of resources . These efforts seeking funding are usually an effort to get more language data. The rationale for this is two fold:
Because so little is known that we do not know if the Ethnologue is correct.
Because there is a conflict between other published sources and the Ethnologue .
In both cases the Ethnologue is not taken as an authoritative and accurate source (in an earnest sense, but certainly a citable source ). In some of the courses at the graduate level linguistics program I am in, citing the data from the Ethnologue is verboten. It is not that the Ethnologue is only cited by grant writers or in reference to language documentation efforts. It is cited by many others like:
persons looking at population statistics
those looking at linguistic diversity
those looking to define what a language is
those looking to organize their language data – Though these people should probably be citing the ISO 639-3 now rather than the Ethnologue
typologists and comparitivists, and typological works like in Multitree or Wichmann et. al’s Causes and consequences of linguistic phylogenetic reticulation.
However, it is this perspective on the Ethnologue where, academics cite it, and then use the citation in a manner to justify proving it (the Ethnologue) wrong or inaccurate which makes me say that it is perceived as a straw man. The Ethnologue has a long history . Both Simons and Campbell point out the Ethnologue did not start out as a linguistics publication. The book was not always published by SIL International, it was formerly published by Wycliffe Bible Translators . It is interesting how, a “non-academic” book has such weight in academia and though some despise its origin it is still a highly cited resource.
Additionally, the Ethnologue does not cited its language tree associations or where it gets it data in its entries.1 It does have citations for the intro to the book but not for the presentation of material under each language. This is uncommon for Language Atlases and Language Reference books. WALS and the Atlas of the World’s Languages both cite sources.
In terms of academic merit, the Ethnologue does leave itself open to criticism. While it is understandable that some details might be left out for the sake of printing a hardbound copy. There is no excuse for not providing this data to the viewers of the entries on-line or in non-paper mediums. Just because the Ethnologue is open to criticism in this way does not mean that it is not edited in earnest or an award winning publication . As a publication it looks to enhance its coverage to include language vitality and EGIDS metrics for the languages it lists . It is also my understanding that every editorial cycle there are multiple thousands of edits which occur and a great deal of effort goes into keeping the book accurate. However for an end user, edits, enacted because of greater, more accurate knowledge about a reported language situation is not distinguished from edits based on corrections to errors. The question will then arise, What was the purpose of the edit? was it mis-infomration before, or do we now just have a more accurate time sensitive snapshot?
SEO for standard websites is pretty straight forward. I happen to be working on a website redesign (in Drupal) which presents Linguistic resources both published and unpublished. I recently came across two specialized SEO options which are useful:
This means implementing the OAI-PMH protocol so that OLAC can harvest it.
I am not sure how this is done exactly… but here is the link: http://www.language-archives.org/.