Interoperability of online dictionary data: 
 A test case using WordPress as a CMS

Linked data is an effort to enhance applications and thereby lives with structured knowledge. This structure at its core is developed by human interaction. The challenge to consumers of linked data is to convince holders of unstructured data to structure it into actionable, manipulatable knowledge.

Speech communities will always be the greatest generators of linguistic data, and therefore also the greatest holders of linguistic knowledge. It is the challenge of researchers to capture that data and structure it. If researchers want to study the data these communities create, then it is on the onus of the researchers to provide the communities with tools to do tasks that the communities find useful and usable (Beaird 2010[ref 1] ). In return, these tools provide a growing body of data for researchers and community members. A web 2.0 design premise makes this possible.

In sub-fields of linguistics such as language documentation where language research and language revitalization often coexist (Rice 2009[ref 2] ), the presence of online dictionaries can facilitate the documentation of a language by granting the community the power to collect the corpus (Elphick & Sohn 2009[ref 3] ). Community input has always been needed for the creation of dictionaries with a broad scope of terms (Moe 2001[ref 4] , 2007[ref 5] ) . In the digital world of web 2.0 we call this “crowd sourcing”. Perhaps the best example of successful crowd sourcing in the German-English online translation community is dict.cc (Hemetsberger 2011[ref 6] ). Dict.cc started in 2002 as a web based German-English translation service. Due to the community involvement and the level of interactivity reached, the dataset available to the online service soon contained more matched pairs of lexical items than the Technical University of Munich’s Project Leo (LEO GmbH 2011[ref 7] ).

SIL International has developed Pathway (SIL International 2011a[ref 8] ), a tool for taking data from the Lexical Interchange Format Standard (LIFT 2009[ref 9] ) and converting it to a variety of output formats. One of the available options is an xhtml output with semantically labeled Cascading Style Sheet classes (Albright 2011[ref 10] ). The presentation of the content is then adjustable in both print and online formats based on the CSS attributes.

The Pathway development team has also released a plug-in for WordPress. WordPress is a popular open source content management system for deploying websites. This plug-in allows the xhtml data to be imported to the WordPress CMS. Researchers who choose to use tools like FLEx (SIL International 2011b[ref 11] ) can now export their data to a database driven website. WordPress can also be deployed and managed by language communities directly, as part of a community based language documentation effort. The power of a database driven website is that it allows for the the presentation of the researcher’s work in addition to active submissions by the community. Due to the semantic classes in the xhtml markup, the dictionary format functions much like a Microformat (Micoformat Community 2011a[ref 12] ) for dictionary data.

Although different from more formal RDF approaches, Microformats still have two objects and a verb. According to the Microformat Community (2011b[ref 13] ), Microformat syntax can be converted to RDF via a number of mechanisms including:

  • microformats that have Xhtml Meta Data Profiles (Tantek Çelik 2003[ref 14] ) can simply be directly used with RDF-compatible URIs for microformats terminology provided by those profiles.
  • with the help of the GRDDL mechanism (W3C 2007[ref 15] ), it is possible to view microformats as domain-specific RDF serializations.

This means that a community’s online dictionary can be used as an RDF data set or an ontology which is domain specific to that particular URI. Hypothetically, this should allow for cross linguistic and typological research on lexical items. With several bi-lingual dictionaries online, it should be possible to query each of them with a single query and return a variety of answers. This kind of cross-linguistic corpus creation has its place in linguistics (Parker 2006[ref 16] ). Notwithstanding this great potential for cross-linguistic research, there are two issues which stand in the way of utilizing these data sets, or dictionaries today:

  1. Discovery of the resources (online dictionaries)
  2. The harvestability of the token data (word sets or pairings)

It is the goal of this test case to present a solution to both of these issues. In resolving the need for the discovery of online dictionaries, most community users will be concerned with SEO results as they look for the front end of the dictionary web site. However, for researchers doing machine based queries of several dictionaries at once another discovery tool is necessary. Thankfully there is already an aggregation service run by OLAC (Simons & Bird 2011[ref 17] ) for the discovery of language and linguistics based resources. OLAC uses the OAI-PMH protocol which is a pull based system for resource discovery. A modification of the above mentioned WordPress plug-in is needed to describe the dictionary site as a language resource using the OAI-PMH protocol. This modification would allow the site to be seen in the aggregation results of the OAI harvester. Because the harvester is a pull based system, the plug-in also needs to provide an advertising interface for the dictionary site administrator to make their site known to the OAI harvester.

The harvestability of token data, has two aspects which deserve attention. Only the first is directly in the scope of this test case.

  1. Refinement of the proposed xhtml Dictionary Standard to comply with design patterns as outlined in the Microformats community.
  2. The implementation of an API using webhooks (Lindsay 2011[ref 18] ) to provide actions to machine based clients.

The currently proposed xhtml Dictionary Standard does not comply with the style of recommendations made by the Microformats community. I imagine that this is for good reason, I know one of which was that it was designed with print layout in mind first. One glaring difference between the proposed xhtml Dictionary Standard and a possible Microfomat compliant recommendation “hDictionary” can be found in proposed xhtml Dictionary Standard the use of

elements. These elements are used for head words and also for the definitions of these head words. However, xhtml already has an element
designated for definitions terms and an element
for the definitions of these terms. Microformat design philosophy calls for the use of elements in the W3C xhtml standard before the use of semantic classes. However, the designers of xhtml did not explicitly account for issues like synonyms, antonyms, parts of speech, or the general structure that detailed linguistics lexicographers keep track of in their dictionaries. It would appear that, some of these lexicography issues can be resolved with semantic CSS classes.

By giving minority language communities tools which can help them create online dictionaries, researchers can seize the opportunity to help those language communities structure their data in formats that enrich the global community as well as their individual communities. I propose that this WordPress plug-in be modified to facilitate the discovery of online dictionaries and that the proposed xhtml Dictionary Standard be reworked to conform to Microfomat design recommendations.


Bibliography

  1. Beaird, Jason. 2010. The principles of beautiful web design, 2nd edn. Collingwood, Vic.: SitePoint Pty. Ltd.
  2. Rice, Keren. 2009. Must There Be Two Solitudes? Language Activists and Linguists Working Together. In Jon Reyhner & Louise Lockard (eds.), Indigenous Language Revitalization: Encouragement, Guidance & Lessons Learned, 37-60. Flagstaff, Arizona: Northern Arizona University.
  3. Elphick, Ester Timbancaya & Virginia Howard Sohn. 2009. Documenting and Preserving Cuyonon. Paper presented at Eleventh International Conference on Austronesian Linguistics, Aussois, France. June
  4. Moe, Ronald. 2001. Lexicography and mass production. Notes on Linguistics 4.1: 150-56.
  5. Moe, Ronald. 2007. Dictionary Development Program. SIL Forum for Language Fieldwork 2007.003: 1-12.
  6. Hemetsberger, Paul. 2011. Dict.cc English-German Dictionary. http://www.dict.cc/?s=about [Accessed: 13 August 2011] [Link].
  7. LEO GmbH. 2011. LEO: An online service of LEO GmbH. http://dict.leo.org. [Accessed: 13 August 2011] [Link]
  8. SIL International. 2011a. Pathway. Computer program. http://pathway.sil.org/ [Accessed: 13 August 2011]
  9. LIFT. 2009. Lexical Interchange Format Standard. http://code.google.com/p/lift-standard/. [Accessed: 13 August 2011] [Link]
  10. Albright, Jim. 2011. Dictionary XHTML Proposed Standard. http://pathway.sil.org/features/standards/dictionary-xhtml-proposed-standard/. [Accessed: 13 August 2011] [Link]
  11. SIL International. 2011b. SIL FieldWorks Language Explorer. Computer program. http://fieldworks.sil.org/. [Accessed: 13 August 2011] [Link]
  12. Microformat Community. 2011a. Microfomats Wiki. http://microformats.org/. [Accessed: 13 August 2011] [Link]
  13. Microformat Community. 2011b. Microformat FAQs relating to RDF. http://microformats.org/wiki/faqs-for-rdf [Accessed: 13 August 2011] [Link]
  14. Tantek Çelik. 2003. Xhtml Meta Data Profiles. http://gmpg.org/xmdp/ [Accessed: 13 August 2011] [Link]
  15. W3C. 2007. Gleaning Resource Descriptions from Dialects of Languages (GRDDL). http://www.w3.org/TR/grddl/. [Accessed 13 August 2011] [Link]
  16. Parker, Steve G. 2006. A cross-linguistic corpus of forms meaning ‘yes’. Linguistic Discovery 4.1: 1-34.
  17. Simons, Gary F. & Steven Bird. 2011. OLAC: Accessing the World’s Language Resources. Paper presented at Poster session on Metadata in Language Documentation and Description, Annual Meeting of the Linguistic Society of America, Pittsburgh, 6–9 January 2011.
  18. Lindsay, Jeff. 2011. WebHooks. http://www.webhooks.org/. [Accessed 13 August 2011] [Link]

Leave a Reply

Your email address will not be published. Required fields are marked *