A Story Breeds A Story

Posted on November 22, 2011 by Hugh Paterson III

While I was in Malaysia, I had the honor to meet and talk to quite a bit with Professor Emeritus Howard McKaughan. We talked a about his linguistics based work in Mexico, the Philippines, and in Malaysia. He can tell stories, interesting stories.

Howard - Story Telling

There is something unique about his generation of Americans (currently in their 80s and 90s). It is their ability to craft and tell stories. I feel that this is a cultural point I don’t have. It could be because I am third culture, or because I talk to much of the macro-details, or it might simply be because I am long winded.
Continue reading →

Language maps like heat maps

Posted on September 18, 2011 by Hugh Paterson III

There is a myriad of difficulties in overlaying language data with geographical data. But it has be done and can be done. While I was working in México on a language documentation project, I learned that some of the language mixing (not quite diglossia, rather the living of two people groups with different languages in the same spaces) was due geographical factors and economical factors pulling them into the same geographic locations. In the particular case I am thinking of there was a mountain pass and a valley on the way to the major center of trade. In this sort of context the interesting things are displayed not when a polygon is drawn showing a territorial overlay of where various language speakers living, but where something is drawn showing what the density or population dispersion per general population is. Some of the most detailed (in terms of global perspective) language maps can be found in the Ethnologue ^[1] Lewis, M. Paul (ed.). 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. .

Western Central Mexico from the Ethnologue

However, as I was working on the language documentation project I found out how much effort actually goes into that sort of map. ArcGIS, the software used to create the maps can not auto-generate a polygon a certain distance around a combined set of given points. A set of points can be selected and each point can get a 5 mile radius. What this means is that each polygon has to be hand drawn. This sort of graphical overly that is used in the the Ethnologue does not show the density of speakers of a language in an area relative to the total population (in the Ethnologue’s defense I am not sure it is supposed to). For instance, if I wanted to know “What is the density of speakers in the Me’phaa area of México relative to speakers of other languages?” that would show me some dispersion, and by implication the peopling of the area. This sort of geographical overlay may be closer to displaying social networks, not really bilingualism or diglossia. There might be some bilinguals or some average level of bilingualism there, but the heat map method of plotting is looking still at the density of speakers to an area. A simular map might be created of New York City where certain languages are given a color based on their distribution density in the area. Additionally, these sorts of data overlays are probably more prone to lend insights on language attrition patterns or language speaker migration patterns. Also these hand drawn polygons change (a little) from edition to edition. Because the data used to create the polygons is not referenced (cited) it is hard to tell if the change is keeping pace with language attrition and/or population movement or if the changes are due to a better linguistic understanding in a particular area. When looking at the large area maps in the Ethnologue, it is hard to tell if the red dots represent “traditional” language area (or geographical center thereof) or if the points represent the current geographical center of the speaking area. Either way the plotting functions as if it were a heat map showing the diversity of languages over a geographical area.

Americas Map from the Ethnologue

gHeat

I am generally on the look out for web apps and APIs which can be used to overlay data to bring new insights to situations through graphical representations. I recently found a tool for overlaying data on Google Maps. This tool creates heat maps given data from another source. This tool is called gHeat. This tool was brough to my attention by Been O’Steen as he modified gHeat to display some prices for student properties ^[4] Ben O’Steen. 2011. Student Property Heatmap. Random Hacks: Hacks, code and other things. [Accessed: 2 September 2011] http://benosteen.wordpress.com/2011/07/26/student-property-heatmap . [Link] in the UK. My initial thought was: “Wow how can we do language maps like this?”

Student Property Heat Map

Obviously I still think that language based heat maps could prove to provide language workers world wide access to visualizations of data that could really add clarity to the language vitality situation.

References[+]

References
↑1	Lewis, M. Paul (ed.). 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International.
↑2	Map of Languages in Western Mexico in the Ethnologue. [Accessed: 9 September 2011] http://www.ethnologue.com/show_map.asp?name=MX&seq=30. [Link]
↑3	Map of Languages in the Americas in the Ethnologue. [Accessed: 9 September 2011] http://www.ethnologue.com/show_map.asp?name=Americas&seq=10. [Link]
↑4	Ben O’Steen. 2011. Student Property Heatmap. Random Hacks: Hacks, code and other things. [Accessed: 2 September 2011] http://benosteen.wordpress.com/2011/07/26/student-property-heatmap . [Link]

Metadata Magic

Posted on August 10, 2011 by Hugh Paterson III

The company I work for has an archive for many kinds of materials. In recent times this company has moved to start a digital repository using DSpace. To facilitate contributions to the repository the company has built an Adobe AIR app which allows for the uploading of metadata to the metadata elements of DSpace as well as the attachement of the digital item to the proper bitstream. Totally Awesome.

However, one of the challenges is that just because the metadata is curated, collected and properly filed, it does not mean that the metadata is embedded in the digital items uploaded to the repository. PDFs are still being uploaded with the PDF’s author attribute set to Microsoft-WordMore about the metadata attributes of PDF/A can be read about on pdfa.org. Not only is the correct metadata and the wrong metadata in the same place at the same time (and being uploaded at the same time) later, when a consumer of the digital file downloads the file, only the wrong metadata will travel with the file. This is not just happening with PDFs but also with .mp3, .wav, .docx, .mov, .jpg and a slew of other file types. This saga of bad metadata in PDFs has been recognized since at least 2004 by James Howison & Abby Goodrum. 2004. Why can’t I manage academic papers like MP3s? The evolution and intent of Metadata standards.

So, today I was looking around to see if Adobe AIR can indeed use some of the available tools to propagate the correct metadata in the files before upload so that when the files arrive in DSpace that they will have the correct metadata.

The first step is to retrieve metadata from files. It seems that Adobe AIR can do this with PDFs. (One would hope so as they are both brain children of the geeks at Adobe.) However, what is needed in this particular set up is a two way street with a check in between. We would need to overwrite what was there with the data we want there.
However, as of 2009, there were no tools in AIR which could manipulate exif Data (for photos).
But it does look like the situation is more hopeful for working with audio metadata.

One way around the limitations of JavaScript itself might be to use JavaScript to call a command-line tool or execute a python, perl, or shell script, or even use a library. There are some technical challenges which need bridged when using these kinds of tools in a cross-platform environment. (Anything from flavors of Linux to, OS X 10.4-10.7 and Windows XP – Current.) This is mostly because of the various ways of implementing scripts on differnt platforms.

The technical challenge is that Adobe AIR is basically a JavaScript environment. As such there are certain technical challenges around implementation of command-line tools like Xpdf from fooLabs and Coherent PDF Tools or Phil Harvey’s ExifTool, Exifv2, pdftk, or even TagLib. One of the things that Adobe AIR can do is call an executable via something called actionscript. There are even examples of how to do this with PDF Metadata. This method uses PurePDF, a complete actionscript PDF library. Actionscript is powerful in and of itself, it can be used to call the XMP metadata of a PDF, Though one could use it to call on Java to do the same “work”.

Three Lingering Thoughts

Even if the Resource and Metadata Packager has the abilities to embed the metadata in the files themselves, it does not mean that the submitters would know about how to use them or why to use them. This is not, however, a valid reason to not include functionality in a development project. All marketing aside, an archive does have a responsibility to consumers of the digital content, that the content will be functional. Part of today’s “functional” is the interoperability of metadata. Consumers do appreciate – even expect – that the metadata will be interoperable. The extra effort taken on the submitting end of the process, pays dividends as consumers use the files with programs like Picasa, iPhoto, PhotoShop, iTunes, Mendeley, Papers, etc.
Another thought that comes to mind is that When one is dealing with large files (over 1 GB) It occurs to me that there is a reason for making a “preview” version of a couple of MB. That is if I have a 2 GB audio file, why not make 4 MB .mp3 file for rapid assessment of the file to see if it is worth downloading the .wav file. It seems that a metadata packager could also create a presentation file on the fly too. This is no-less true with photos or images. If a command-line tool could be used like imagemagick, that would be awesome.
This problem has been addressed in the open source library science world. In fact a nice piece of software does live out there. It is called the Metadata Extraction Tool. It is not an end-all for all of this archive’s needs but it is a solution for some needs of this type.

Impossible English Grammar

Posted on July 9, 2011 by Hugh Paterson III

While I was in Mexico, I was walking to the store with a friend, who is also a fellow linguistics student. He was telling me a story. In the course of that story a naturally occurring sentence flowed “out of his mouth”. After he said that sentence I let him finish his thought and I asked him if the sentence was gramatical.

Here is the sentence:

Yesterday, I saw the latin version of one of my friend’s husbands in Sorinana.Sorinana chain of stores in Mexico.

My contention was that the “s” on “husbands” was ungrammatical.

Of course, if the sentence is read:

Yesterday, I saw the latin version of one of my friend’s husband in Sorinana.

The sentence sounds awkward. Perhaps it is not a well formed sentence. But is it ungrammatical? What is the violation which makes the sentence sound awkward? Is it the constrained unit [one of my friend’s] which is embedded in another gramatical unit, which is apparently unconstrained [the latin version of…]?

We tried to move the gramatical units around and did not find a satisfying solution.

Yesterday, I saw the latin version of the husband of one of my friends in Soriana.

Yesterday, I saw the latin version of one of my friend’s husband in Soriana.

Yesterday, I saw a man who looked like my friend’s husband in Soriana.

Yesterday, I saw a man who could have passed as the latin version of one of my friend’s husband in Soriana.

Yesterday, I saw a man who could have passed as the latin version of the husband of one of my friends in Soriana.

Yesterday, in Soriana, I saw a latino version of my friend’s husband.

Yesterday, I saw a the latin version of the husband of one of my friends in Soriana.

Yesterday, I saw in Soriana the latin version of the husband of one of my friends.

All this variation in options of for information ordering has led me to ask three questions of English:

How is Time, Manner and place naturally ordered in English?
What is the prominent element of information in each option and why?
What are the Elements?

Network Language Documentation File Management

Posted on May 4, 2011 by Hugh Paterson III

This post is a open draft! It might be updated at any time… But was last updated on at .

Meta-data is not just for Archives

Bringing the usefulness of meta-data to the language project workflow
It has recently come to my attention that there is a challenge when considering the need for a network accessible file management solution during a language documentation project. This comes with my first introduction to linguistic field experience and my first field setting for a language documentation project.The project I was involved with was documenting 4 Languages in the same language family. The Location was in Mexico. We had high-speed Internet, and a Local Area Network. Stable electric (more than not). The heart of the language communities were a 2-3 hour drive from where we were staying, so we could make trips to different villages in the language community, and there were language consultants coming to us from various villages. Those consultants who came to us were computer literate and were capable of writing in their language. The methods of the documentation project was motivated along the lines of: “we want to know ‘xyz’ so we can write a paper about ‘xyz’ so lets elicit things about ‘xyz'”. In a sense, the project was product oriented rather than (anthropological) framework oriented. We had a recording booth. Our consultants could log into a Google-doc and fill out a paradigm, we could run the list of words given to us through the Google-doc to a word processor and create a list to be recorded. Give that list to the recording technician and then produce a recorded list. Our consultants could also create a story, and often did and then we would help them to revise it and record it. We had Geo-Social data from the Mexican government census. We had Geo-spacial data from our own GPS units. During the corse of the project massive amounts of data were created in a wide variety of formats. Additionally, in the case of this project language description is happening concurrently with language documentation. The result is that additional data is desired and generated. That is, language documentation and language description feed each other in a symbiotic relationship. Description helps us understand why this language is so important to document and which data to get, documenting it gives us the data for doing analysis to describe the language. The challenge has been how do we organize the data in meaningful and useful ways for current work and future work (archiving)?People are evidently doing it, all over the world… maybe I just need to know how they are doing it. In our project there were two opposing needs for the data:

Data organization for archiving.
Data organization for current use in analysis and evaluation of what else to document.It could be argued that a well planned corpus would eliminate, or reduce the need for flexibility to decide what else there is to document. This line of thought does have its merits. But flexibility is needed by those people who do not try to implement detailed plans.

Continue reading →

Riddles, Poems, and Tangle-Worded Couplets

Posted on April 7, 2011 by Hugh Paterson III

We were sitting around the kitchen table after pizza one night, when the neighbor started to tell some jokes. After a few jokes others around the table started to tell their favorite jokes. Soon the neighbor turned to me and said, “you are up next”. Fear struck my heart. Continue reading →

Using Endnote X4 for Mac

Posted on February 10, 2011 by Hugh Paterson III

One of the most popular Citation Management software applications among academics is the application Endnote. Endnote has a long history is published by a reputable company, and has some pretty cool features. I use it (version X4) primarily because it is the only citation softwareThere is other citation management software for OS X which claims integration with pages but none of these solutions are endorsed or supported by Apple. Some of the other applications which claim integration with Pages are:

Sente
Bookends
Papers – This is according to Wikipedia, but I own and use Papers 1.9.7 and have not seen how to integrate it with Pages. (However, Papers2, released March 8th, 2011 does say that it supports citation integration with Pages.)

which integrates natively with the word processor Pages, by Apple, Inc. The software boasts a bit of flexibility and quite a few useful features. Some of the really useful features I use are below.

Customizing the output style of the bibliographies.
There are several Linguistics Journals with style sheets on Endnote’s Website. Among them are:
Additionally there is a version of the Unified Linguistics Style Sheet available for Endnote. This is available from Manchester UK. http://www.llc.manchester.ac.uk/intranet/ug/useful-links/computing-resources/wordprocessing/. [.ens file]
Looking for PDF files.
Attaching additional meta-data to each citation. (Like ISO 639-3 Language Codes)
Adding additional types of resources like Rutgers Optimality Archive Documents with an ROA number.
Smart Groups of files based on desired criteria.
Integration with Apple’s word processor Pages.
Research Notes section in the citation’s file for creating an annotated bibliography.
Copy our all the selected works, so that they can be pasted as a bibliography in another document.
XML Output of Citation Data
The XML Support of Endnote has not been hailed as the greatest implementation of XML but there are tools out there to work with it.
- http://www.uns.ethz.ch/pub/publications/xml_format
- XSL Transformation stylesheet for EndNote XML.

However, regardless of how many good features I find and use in Endnote there are several things about it which irk me to no end. This is sort of a laundry list of these problematic areas.

~~Can not sort by resource type~~:
For instance if I wanted to sort or create a smart list of all my Book references, or just Journal Articles. This can be done, one just has to create a smart list and then set Reference Type to Contains: “Book Section”. There is not a drop down list of reference types invoked by the user.
~~Can not sort by custom field~~:
I think you can do this in the interface. Though it was not obvious on how to do it.
Endnote Display Fields
Can not view all the custom fields for a resource type across all resources.
This seems to be limited to eight fileds in the sorting viewer at a time.
~~Can not view all entries without content in a specified field.~~
This would be especially nice to create a smart list for this.
No exports of PDFs or exports of PDFs with .ris files.
There is no keyboard short-cut to bring up the Import function (or Export) under File Menu
Does not rename PDFs based on metadata of the resource.
This is possible with Papers and Mendeley. The user has the option to rename the file based on things like Author, Date of publication, etc.
Can not create a smart list based on a constant in the Issue data part.
I have Volume and Issue Data. Some of the citation data pulled in for some items has the issue set as 02, 03, etc. I want to be able to find all the issues which start with a zero so I can remove the zeros. Most stylesheets do not remove the zeros and also do not allow for them.
Items are not selectable based on Issue Data
Can not export PDFs with embedded metadata in the PDF.
Can not open the folder which contains a PDF included in an Endnote Library.
Endnote does not have a way to open the containing folder of a PDF

Close-up of the same menu.
Modifying Resource type does not accept |Language| Subject Language|
~~There is no guide in any of Endnote’s documentation for how to create an export style sheet.~~
This is in the Help Menus I was expecting it on the producers website or in a book.
When editing an entry’s meta-data i.e. the author, or the title of a work, pressing TAB does not move the cursor to the next field.
At least some times it does not continue to TAB. If I do a new entry as a Journal article, then it will tab till the issue field, but not beyond. It gets stuck.
There is no LAN collaboration or sharing feature for a local network solution.
There is no Cloud based collaborative solution.
There is no way to create a smart group based off of a subset of items in a normal group.
i.e. I want to create a smart group of all the references with a PDF attached but I only want it to pull from the items in a particular group (or set of groups).
Endnote has no option for 'part of group' filter
There is no PDF Preview within the application. The existing Preview is for seeing the current citation in the selected citation style. (Preview of the output.) It would be helpful if there was also a preview pane for viewing the PDF or the attached file.
No PDF Preview internal to Endnote

Learned or Innate

Posted on July 21, 2010 by Hugh Paterson III

I presented on Jeff Mielke’s (his web page) The Emergence of Distinctive Features.

The two questions covered in this presentation are:

Are features learned or innate?
Do we have sound patterns from features or do we have features from sound patterns?

PDF of Slides: [download id=”2″]

Open Source Language Codes Meta-data

Posted on July 10, 2010 by Hugh Paterson III

One of the projects I have been involved with published a paper this week in JIPA. It is a first for me; being published. Being the thoughtful person I am, I was considering how this paper will be categorized by librarians. For the most part papers themselves are not catalogued. Rather journals are catalogued. In a sense this is reasonable considering all the additional meta-data librarians would have to create in their meta-data tracking systems. However, in today’s world of computer catalogues it is really a shame that a user can’t go to a library catalogue and say what resources are related to German [deu]? As a language and linguistics researcher I would like to quickly reference all the titles in a library or collection which reference a particular language. The use of the ISO 639-3 standard can and does help with this. OLAC also tires to help with this resource location problem by aggregating the tagged contents of participating libraries. But in our case the paper makes reference to over 15 languages via ISO 639-3 codes. So our paper should have at least those 15 codes in its meta-data entry. Furthermore, there is no way for independent researchers to list their resource in the OLAC aggregation of resources. That is, I can not go to the OLAC website and add my citation and connect it to a particular language code.

There is one more twist which I noticed today too. One of the ISO codes is already out of date. This could be conceived of as a publication error. But even if the ISO had made its change after our paper was published then this issue would still be persistent.

During the course of the research and publication process of our paper, change request 2009-78 was accepted by the ISO 639-3 Registrar. This is actually a good thing. (I really am pro ISO 639-3.)

Basically, Buhi’non Bikol is now considered a distinct language and has been assigned the code [ubl]. It was formerly considered to be a variety of Albay Bicolano [bhk]. As a result of this change [bhk] has now been retired.

Here is where we use the old code, on page 208 we say:

voiced velar fricative [ɣ]

Aklanon [AKL] (Scheerer 1920, Ryder 1940, de la Cruz & Zorc 1968, Payne 1978, Zorc 1995) (Zorc 1995: 344 considers the sound a velar approximant)

Buhi’non [BHK] (McFarland 1974)

In reality McFarland did not reference the ISO code in 1974. (ISO 639-3 didn’t exist yet!) So the persistent information is that it was the language Buhi’non. I am not so concerned with errata or getting the publication to be corrected. What I want is for people to be able to find this resource when they are looking for it. (And that includes searches which are looking for a resource based on the languages which that resource references.)

The bottom line is that the ISO does change. And when it does change we can start referencing our new publications and data to the current codes. But there are going to be thousands of libraries out there with out-dated language codes referencing older publications. A librarian’s perspective might say that they need to add both the old and the new codes to the card catalogues. This is probably the best way to go about this. But who will notice that the catalogues need to be updated with the new codes? What this change makes me think is that there needs to be an Open Source vehicle where linguists and language researchers can give their knowledge about a language resources a community. Then librarians can pull that meta-data from that community. The community needs to be able to vet the meta-data so that the librarians feel like it is credible meta-data. In this way the quality and relevance of Meta-data can always be improved upon.

SSH, Unix commands & RegEx

Posted on June 16, 2010 by Hugh Paterson III

This summer I am sitting in on a computational linguistics course. It is the first instruction I have had about UNIX. Pretty Awesome.
This has required me to do some googling looking from terminal commands.

This is kind of a sketch of where I have been.

UNIX:
http://www.osxfaq.com/Tutorials/LearningCenter/

SSH:
http://kimmo.suominen.com/docs/ssh/
http://ss64.com/osx/

TERMINAL:
http://homepage.mac.com/rgriff/files/TerminalBasics.pdf

grep:
http://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/
http://en.wikipedia.org/wiki/Grep
http://www.computerhope.com/unix/ugrep.htm

Regular Expressions:
http://www.zytrax.com/tech/web/regex.htm
http://www.regular-expressions.info/tutorial.html
http://gnosis.cx/publish/programming/regular_expressions.html

RegEx and Unicode:
One of the issues that I have had with RegEx has been what is a natural class? i.e. [A-Z], [A-Za-z], [0-9], etc. As a linguist I deal a lot with IPA characters, subscripts, superscripts, unicode, and diacritics. How am I to define a natural class with these? Can I define a natural class based on the phonology of the language?

So I did some more searching:
http://unicode.org/reports/tr18/
http://unicode.org/reports/tr18/tr18-5.1.html
http://icu-project.org/docs/papers/iuc26_regexp.pdf
http://courses.ischool.berkeley.edu/i256/f06/papers/regexps_tutorial.pdf
http://wapedia.mobi/en/Regular_expression?t=5.

RegEx+PERL+Unicode:
http://perldoc.perl.org/perlretut.html

PERL:
http://www.enginsite.com/Library-Perl-Regular-Expressions-Tutorial.htm
http://www.cgi101.com/book/connect/mac.html
http://www.mactech.com/articles/mactech/Vol.18/18.09/PerlforMacOSX/index.html

Python:
http://www.amk.ca/python/howto/regex/

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Tag Archives: Linguistics