This post is a open draft! It was originally started on April 23, 2011. Almost two years later it makes it's public debut. It might be updated at any time... But was last updated on December 18, 2013 at 8:41 pm.
This post is a open draft! It might be updated at any time… But was last updated on December 18, 2013 at 8:41 pm.
Meta-data is not just for Archives
Bringing the usefulness of meta-data to the language project workflow
It has recently come to my attention that there is a challenge when considering the need for a network accessible file management solution during a language documentation project. This comes with my first introduction to linguistic field experience and my first field setting for a language documentation project.The project I was involved with was documenting 4 Languages in the same language family. The Location was in Mexico. We had high-speed Internet, and a Local Area Network. Stable electric (more than not). The heart of the language communities were a 2-3 hour drive from where we were staying, so we could make trips to different villages in the language community, and there were language consultants coming to us from various villages. Those consultants who came to us were computer literate and were capable of writing in their language. The methods of the documentation project was motivated along the lines of: “we want to know ‘xyz’ so we can write a paper about ‘xyz’ so lets elicit things about ‘xyz'”. In a sense, the project was product oriented rather than (anthropological) framework oriented. We had a recording booth. Our consultants could log into a Google-doc and fill out a paradigm, we could run the list of words given to us through the Google-doc to a word processor and create a list to be recorded. Give that list to the recording technician and then produce a recorded list. Our consultants could also create a story, and often did and then we would help them to revise it and record it. We had Geo-Social data from the Mexican government census. We had Geo-spacial data from our own GPS units. During the corse of the project massive amounts of data were created in a wide variety of formats. Additionally, in the case of this project language description is happening concurrently with language documentation. The result is that additional data is desired and generated. That is, language documentation and language description feed each other in a symbiotic relationship. Description helps us understand why this language is so important to document and which data to get, documenting it gives us the data for doing analysis to describe the language. The challenge has been how do we organize the data in meaningful and useful ways for current work and future work (archiving)?People are evidently doing it, all over the world… maybe I just need to know how they are doing it. In our project there were two opposing needs for the data:
- Data organization for archiving.
- Data organization for current use in analysis and evaluation of what else to document.It could be argued that a well planned corpus would eliminate, or reduce the need for flexibility to decide what else there is to document. This line of thought does have its merits. But flexibility is needed by those people who do not try to implement detailed plans.
Notice that this is not the difference between the archive form of the data and the presentational form of the data. This is the difference between the working form of the data and the archival form of the dataIn the project I worked on there was a lot of thinking along the lines: “what do I want to present on this language?”, so that is what I need to archive, lets collect that..
During the course of the project it became clear that tracking meta-data was an extra burden (overhead) rather than something which actually added new insight to the analysis, or even energy to the projectOne aspect of meta-data collection was renaming the files in appropriate ways so that the right files grouped together. This took some thinking through and the development of a file naming convention. To assist in the task of file renaming several application were reviewed:
Of these application A Better Finder Rename was determined to be the best price for the options it provided. This was fallowed by Name Mangler and lastly the one with the fewest features was Batch File Rename. Not tested were Renamer (OS X) or Bulk Rename Utility. (Bulk Rename Utility is Windows Only.) A Better Finder Rename is both Windows and OS X.
Of these application A Better Finder Rename was determined to be the best price for the options it provided. This was fallowed by Name Mangler and lastly the one with the fewest features was Batch File Rename. Not tested were Renamer (OS X) or Bulk Rename Utility. (Bulk Rename Utility is Windows Only.) A Better Finder Rename is both Windows and OS X.. This is partially because the data (even the files containing the data) was not always organized in a helpful way“Helpful” in this context should be explored a bit further than just acknowledged on the surface level. There were several schools of thoughts among the participants, and there were different kinds of objectives for each of the participants. This was reflected in the way we worked and what we worked on. Our needs (so things can be “helpful”) also depended on if our approaches were event oriented or product oriented. (I am doing a paper on place names so I need to collect place names… as an example of product oriented. This is contrasted with “I am recording a story and a place name comes up”… as an example of event oriented.) Organizing all the recordings containing place names would be one way to organize the files, another way to organize the files would be to mark each event which contains a place name but to organize all the files related to one recording event together. i.e. The audio file, the video file, the transcription, the translation, the geo coordinates relevant to the recording, etc. The next logical question, if the organization by event option is selected, is to choose how to organize the events.. Another reason was that files could not be sorted and arranged based on meta-data values associates with the files.
A quick look at the methods used to keep track of the data:
- Sever with folder and files.
- Excel spreadsheet with meta-data values which would be ingested by an archive.
With a workflow influenced by a language documentation view of a project, it follows that files need to be organized so that they can be submitted to the archive. However, this forces the organizers of the files into a paradigm of viewing their files (this paradigm might be provided by the archive or it might be system of organization created just to survive the onslaught of data during the project). However, the data and meta-data are not without relationships; relationships to each other, to the participants, to the local geography, to other languages, to project plans and objectives, to linguistic theory, to community interests (just to name a few). A preformed paradigm of organization will not respect all of these relationships. Each of these relationships show the research team something useful, which during analysis can affect the documentation project. The interface to the data needs to be able to help the project participants view and connect with the data in a relevant, intuitive, informative, collaborative, progressive, time efficient and relational ways. In a way this is a data visualization issue.There are many kinds of data visualization techniques and tools. Data visualizations are really important to the way we think and understand things in a time-space continuum (Item “a” was in location “x” at time “y”, while item “b” was in location “w” at time “y”). Visualizations are also really good for communicating things which are removed from our local cognitive environment (like the massive interrelatedness of the internet) and abstract concepts like workflows and conceptual categories (put the data in this bucket or that bucket). They are also good at demonstrating the relationship between two (or more) concepts (These are sometimes called mash-ups.) The following is a map of the earthquake and aftershocks in Japan following the big earthquake in 2011. This shows the relationship between locations and magnitude of the earthquakes over time.
This is a visualization of the relatedness of several bands I like. Thanks to http://audiomap.tuneglue.net/ This is a visualization of the xhtml tags in Becky’s Blog. Available through http://www.aharef.info/static/htmlgraph/.
blue: for links (the A tag)
red: for tables (TABLE, TR and TD tags)
green: for the DIV tag
violet: for images (the IMG tag)
yellow: for forms (FORM, INPUT, TEXTAREA, SELECT and OPTION tags)
orange: for linebreaks and blockquotes (BR, P, and BLOCKQUOTE tags)
black: the HTML tag, the root node
gray: all other tags
This is a visualization of the relatedness of several bands I like. Thanks to http://audiomap.tuneglue.net/
This is a visualization of the xhtml tags in Becky’s Blog. Available through http://www.aharef.info/static/htmlgraph/.What do the colors mean?
Another visualization tool is Weeplaces.
Displaying the complexities of the relationships in the meta-data and the data is something that the file system of modern Operating Systems (OS X, Windows, Linux) can not deliver. Something else that the file system can not deliver is an organization by “custom” meta-data attributes. Yet one more limitation of the files system is that it does not allow for the re-use of metadata (implied meta-data based on associations of files). I can not drop a file in a folder and then know that the speaker on that file was “xyz” because the other two files in the folder both had the speaker “xyz”.
Alternatively to organizing a project from a Language Documentation perspective one could organize the project topically according to descriptive interest or explorative process. That is, one might choose a topic of interest and organize the files and name them according to how they relate to the particular topic of interest. i.e. grammar, nominals, noun-marking, di-transitives. Not that the points of topical interest are any less valuable to the language documentation project but that the organizational taxonomies are different. Due to the limitations of today’s (2011) OSes only one hierarchical structure can be used at a time. It is unfortunate that when looking at larger projects and packaging content for archives, that these methods of arranging data (with their organizational structures, and necessarily so) are not organized by the way that many researchers would pre-fer to organize their data so that it is useful to them while they work.
I have also recently been introduced to Drupal. Although Drupal is designed to be a web-based CMS. It is entirely possible to run it on a laptop for personal use. (Much like what is recommended for a Developers environment.) As of this writing the most current release of Drupal is at 7.x, however more modules which might prove useful to making Drupal work for this task are currently available for Drupal 6.x than for version 7.x so this experiment will start out with version 6.x.
Defining the target set of users
Initially the target set of users is the language researchers on a language documentation project. It is assumed that investigators in a Language Documentation Project will be M.A. Students or above, and have proficient command of computing technologies as a computer user (obviously there will be some variation in proficiency). However it is possible that with a solution which enables a distributed workforce to collaborate on a single project that anyone with a computer and a browser will be able to use the tool. This means that potentially it could also be language consultants as well. However, the capacity and the functions to which the various participants would be involved in using the tool could conceivably change or at least vary.
Must Haves – Features
There are many cool things which can be done with Drupal. In order to prevent project creep, it is important to clearly identify the needs of the project. Several important goals were considered before even settling on Drupal as a CMS or base solution.
- The ability to implement a solution across Windows, OS X and Linux platforms.
- The ability to use the same solution on a personal computer and on a server on a local network, or a Wide Area Network (web hosted).
- The solution, if browser based had to be IE, FF, Safari compliant (CSS2 minimally, even CSS3).
- The ability to use customizable, off the shelf, parts to create a complete solution.
- The ability to keep the final product at an acceptable cost to end users.
- The ability to leverage good UI design on the final product to enable end users to quickly adapt to concepts presented by the work and to intuitively know how to use the tool.
- The install process of the final product needed to be simple enough to allow for wide spread adaption with a low impact on needed support for customizations.
- The ability to share information among project users (researchers and contributors, language consultants) in a collaborative environment.
- The ability to share information with the greater linguistic community via established norms.
In addition to these requirements there are certain technologies which are prevalent to the field of Language Documentation. These range in scope from file types, API’s, data sets to research frameworks. Also to be taken into account are other technological tools used in a language documentation project. i.e. FLEx, ELAN, Toolbox, Tools from the Max Plank Institute. Methodologies from both Language Documentation and Language Description (Descriptive Linguistics) need to be accounted for, as projects, depending on their purpose, staffing and desired impacts are likely to use methodologies from both sub-fields of Linguistics. These research requirements also needed to be accounted for in a systematic way. The method used to account for these was to walk through a Language Documentation project to experience first hand the work-place data handling requirements.
In a way there are three levels of requirements:
- there are organizational and processing requirements like being able to extract from and write meta-data to file types, view files on a map related to the location they represent or the type of document they are
- there are functional requirements .i.e. UX requirements like drag-and-drop file additions.
- there are design requirement i.e. UI requirements like clear lines between different sections of the CMS so that a user can know where they are and navigate to where they need to go, or when there is a difference in being logged in as a user/researcher and as a site god or admin.
These requirements relate loosely to three areas of the process as they are experienced by the end user:
- What the user wants to get done.
- How the user gets it done.
- How it is organized for the user to get it done.
Ultimately what we are looking for is a platform to facilitate the collaboration around a language documentation (research) project.As I was beginning to write this and look into drupal as solution I ran across an experiment in trying to get Drupal to be a virtual research environment. A large part of this collaboration revolves around appropriate organization, perhaps the rest of it revolves around access to actionable information; having the right information collected together to and processable so that the right decisions can be made.
So, in pursuing what kind of platform can provide this functionality and what that platform might look like, let us look at the first element in design as presented above: What the user wants to get done. These “tasks” are presented by the numeral points below.
- The Plan
Most research projects start with a plan, in fact, most projects accomplish something, but only successful projects have stated what they wanted to achieve, and acknowledge achieving it. It can be stated like this:
It has been said that he that fails to plan, plans to fail. However, I wonder, if one plans to fail and really does it, does he succeed?
Reference Gary Simon’s paper (which one? – the one about plaining a documentation corpus?) and the things to consider.
This module needs to be able to help users define deliverables (goals), tools used, and locations, costs, etc. It needs to be able to help convey these goals to the team members and the language consultants. As files are created and held in this content management solution it needs to be able to relate files to multiple deliverables.
Action Item: GET A MODULE FOR PROJECT PLANNING
Cross comparison of modules supporting task management.
Another option is to use Open Atrium.
- The Background Research
The plan is usually enhanced and refined with background research or Library work. Researchers needed a place to share citations, and the content that the citations reference (usually PDFs). They needed to be able to share the citation data so that various authors on documents related to the project had a single source of citations they were referencing. They also needed a way to talk to each other about how a resource is used, relevant to the project or otherwise of interest.
Action Item: GET A MODULE WHICH HOLDS THE RESEARCH FILES COLLECTED AND ‘HOLDS, DISPLAYS AND SORTS’ BY THE COLLECTED METADATA’S RESOURCES I.E. THE CITATION DATA.
There is one draw back which needs to be addressed in this module, that is there is no way to add another data type to the drupal data base.
- OLAC resources from search.language-archives.org
- Wikipedia resources on the language or linguistic topic
- Linguist List Resources like llmap.org
- Pull in known resources from OLAC on a given language: http://drupal.org/project/oai_pmh Have it recheck every 10 days or as the user sets.
I am not sure if there is even an API for querying items aggregated by OLAC. To assist with this we might need to look at using YQL, Yahoo Query Language and a drupal module http://drupal.org/project/yql_views_query.
- It would also be good to sync data with wikipedia
- Resources from Linguist List I have no clue on how to search for this but there must be away to aggregate the results of a search on their site and then say I am not interested in a particular result
In addition to this, it is important to know what other resources on this subject are out there. There are three kinds of resources which quickly come to mind:
These items need to be pulled into a que where the research team can decide if they are relevant or not. If they are relevant then they become part of the research body of literature which the team is collecting.
If they are not relevant than they are removed from the cue and not shown again.
Additionally, as items from this project might get introduced to the aggregation (especially via OLAC) it would be important to never show items which are sourced from this project.
- It would be good to look into ingesting content from RSS Feeds from OLAC.
- Pass Known created resources to OLAC: http://drupal.org/project/oai2
At the heart of a collaborative research endeavor is a file management system which makes the right information clump together. Even as we create the plan and revise the plan with relevant background research we find that there is a need to organize the resources we have collected and then to be able to use the resources and collaborate concerning their implication on the research project. The right clumping of collected resources and creations enhances analysis, communication and ultimately helps to achieve the goals set out in the plan. The Data Types and File Types for every project will depend on the tools used and the deliverables. Data Types and File Types are different and need to be treated differently. For instance, Shape Files [.shp] are related to cartography and map-making, while GPS eXchange Format [.gpx] files are XML files of GPS tracks, Both can be used in map making and both can enhance other files but they are different file types (by virtue of their extensions) and they are generally lumped together as being the same data type. i.e. one does not contain audio or image data.
Keeping in mind the three things form, function and organization, (from the users’ perspective) which are mentioned above brings us to a point where we can discuss the file types and the data types of the project and how they need to interrelate. That is, what are the elements of an effective and efficient file management strategy?
INSERT SOME PARAGRAPH ABOUT THE RE-USABILITY OF META-DATA and how that contributes to research and so is also research
We use Geodata:
Spreadsheet: [.xls] , [.xlsx] Shape files: [.shp], group of 4 files.
GPS eXchange Format:[.gpx]
We use Audio Data:
MPEG-1 or MPEG-2 Audio Layer III [.mp3] Broadcast WAV [.wav]
We use Paper and Pens:
These we scan to PDFs, I assume that they are Tiffs in a PDF wrapper, but for the most part we don’t really care about the archive quality of these documents. They are often just notes. However, sometimes they are data from consultants. [.pdf]
We use Word Processors:
We use word processors to handle data and analysis.
[.doc] [.docx] [.txt] XLingPaper [.xml] [.pdf] OpenOffice file format [.odt]
We use Citations:
RIS [.ris] Endnote [.enw] XlingPaper [.xml]
We use Cloud based solutions:
Google Docs (mostly the spread sheets)
We use Linguistic Data tools:
Flex  Praat  ELAN  Audacity [.aup]
We track meta-data:
Spread sheet [.xls] [.xlsx] [.csv]
We use Data from Prior Related Projects:
Spread sheet [.xls] [.xlsx] [.doc] [.xml]
We use Databases like Drupal!
We also have Task Types: Elicitation in writing, oral elicitation, Video events, analysis, publication, Sharing, collaboration, note taking, publicity, engagement of the community, collect, annotate, archive, produce community products, Implementation of the Plan.
Enter the Research framework: which says, if we are going to do ‘xyz’ then we need to do ‘abc’ in this method and consider these ‘def’ variables and values.
The place of the Data Model. How things are related from a research perspective.
Things that the file management needs to accomplish.
Rename the files
read the file’s metadata
collect the metadata about the files from the researcher as is established in the plan.
Suggest related metadata which is already in the system based on input in the upload process.
present relevant metadata according to given options
write the files according to the file location specified by the meta-data
Embed the metadata directly to the files where possible
present the user options on how to find the files,
find files by (search):
by recording location
by text type, etc
Audio files http://drupal.org/project/audio_filefield
Video Files http://drupal.org/node/555174
Drag and Drop Upload: http://drupal.org/project/dragndrop_uploads
File Manager: http://drupal.org/project/elfinder
UI for file management includes
Full text (look through PDFs)
How it got that way why it was that way
DESCRIBE THE RESEARCH/OFFICE ENVIRONMENT
collaboration, communication, goals, experience,
Why it is like that?
Is it what we want?
What do we want?
Post footnotes: http://drupal.org/project/footnotes
Working with PDFs:
Shape files in drupal
GEO What ever integration into drupal
Consume .GPX files
how to cite it: http://www.ihsenergy.com/epsg/guid4.html
Modules currently uses:
|Biblio||Maintains biblographic lists.|
|Biblio – PubMed||Adds PubMed import and search to the Biblio module.|
|Bitcache||Provides a distributed, content-addressable repository for data storage.|
|Content||Allows administrators to define new content types.|
|Content Copy||Enables ability to import/export field definitions.|
|Content Multigroup||Combine multiple CCK fields into repeating field collections that work in unison.|
|Content Permissions||Set field-level permissions for CCK fields.|
|Fieldgroup||Create display groups for CCK fields.|
|Node Reference||Defines a field type for referencing one node from another.|
|Number||Defines numeric field types.|
|Option Widgets||Defines selection, check box and radio button widgets for text and numeric fields.|
|Text||Defines simple text field types.|
|User Reference||Defines a field type for referencing a user from a node.|
|DAV file system||Provides WebDAV access to an administrator-specified file system directory on the server.|
|Drag'n'Drop Uploads||Adds Drag’n’Drop file upload functonality.|
|Drupal tweaks||Drupal tweaks module for additional functionality|
|Database tweaks||Database tweaks module for additional MySQL settings|
|form2block||Convert any form or node into block|
|Inline Msg||Move validation messages above the form elements.|
|External RDF Vocabulary Importer||Allows to import external Vocabularies in order to map them to Drupal data objects.|
|Evoc Reference||Defines a field type for referencing an RDF class or property from a node.|
|Author Facet||Provides a facet for content authors.|
|Content Type Facet||Provides a facet for content types.|
|Date Authored Facet||Provides a facet for searching content by date of creation.|
|Date Facets Format||Provides formatting options for date-based facets.|
|Faceted Search||API for performing faceted searches.|
|Faceted Search UI||User interface for searching and browsing through multiple facets.|
|Faceted Search Views||Allows to use Views to display Faceted Search results.|
|CCK Field Indexer||Allows the indexing of CCK fields into the search index.|
|Field Indexer||Provides a configuration page and an API for indexing fields into the search index.|
|Node Field Indexer||Allows the indexing of core node data into the search index.|
|FileField Paths||Adds improved Token based file sorting and renaming functionalities.|
|Attachments||Allows attaching files to nodes and comments.|
|Browser||Provides a file browser for file nodes organized in a hierarchical taxonomy tree.|
|CCK||Integrates file operations with the CCK module.|
|Converters||Allows files to be converted from one MIME content type into another as needed.|
|Embed||Provides an input filter for embedding file into other content.|
|Gallery||Provides a taxonomy-based gallery view of various file types.|
|Restrictions||Controls which restrictions should be applied on the uploaded files.|
|Views||Integrates file operations with the Views module.|
|File||Allows uploading files as a standalone content type and provides a comprehensive file management framework.|
|Audio files||Supports audio file formats.|
|Documents||Supports document file formats.|
|Images||Supports image file formats.|
|Spreadsheets||Supports spreadsheet file formats.|
|Texts||Supports text file formats.|
|File relations server||Provides WebDAV access to nodes with attached and related files by group, group taxonomy, node type, and file type.|
|File Import||Allows batches of files to be imported from a directory on the server.|
|Format Number API||This module provides a method to configure number formats (site default and user defined) with configurable decimal point and thousand separators. It also exposes several functions that can be used by other contributed or custom modules to display numbers accordingly.|
|Geo||Provides a storage engine and API for Geospatial data.|
|Geo Data||Interface with external GIS data, providing views and field integration|
|Geo Field||Provides a CCK field for Geospatial data.|
|Geo UI||An interface for working with geospatial data.|
|Importer||Module for importing external data into the database with some additional features (see README)|
|Location CCK||Defines a Location field type.|
|Location||The location module allows you to associate a geographic location with content and users. Users can do proximity searches by postal code. This is useful for organizing communities that have a geographic presence.|
|OAI2||This module provides Open Archives 2 protocol access to the information stored by the Biblio module|
|Organic groups||Enable users to create and manage groups. OG Views integration module is recommended for best experience.|
|QueryPath for Arrays||A QueryPath implementation that operates on arrays.|
|QPCache||An XML caching layer for QueryPath or other tools.|
|QueryPath||QueryPath is a developer tool for working with HTML and XML.|
|QueryPath Examples||View and run some examples of QueryPath.|
|RDF||Enables the use of Resource Description Framework (RDF) metadata.|
|RDF CCK||Allows to define mappings between CCK fields and RDF and export nodes as RDF.|
|Relations API||Provides an API for arbitrary node relationships based on RDF.|
|Schema||The Schema module provides functionality built on the Schema API.|
|Sheetnode Google Spreadsheets||Provides Google Spreadsheets import/export support for Sheetnode.|
|Sheetnode HTML||Provides HTML TABLE import/export support for Sheetnode.|
|Sheetnode PHPExcel||Provides PHPExcel integration with Sheetnode.|
|Sheetnode Text||Provides text table import/export support for Sheetnode.|
|Sheetnode||Spreadsheet node using SocialCalc.|
|SPARQL API||Enables the use of SPARQL queries with the RDF API.|
|Taxonomy CSV import/export||Export and import complete taxonomies, hierarchical structure or simple lists of terms and properties to or from a CSV file, url or text.|
|Token||Provides a shared API for replacement of textual placeholders with actual data.|
|TokenSTARTER||Provides additional tokens and a base on which to build your own tokens.|
|Token actions||Provides enhanced versions of core Drupal actions using the Token module.|
|Table Wizard||The Table Wizard provides services for managing database tables|
|Table Wizard Import Delimited Files||Provides default hooks for importing delimited files into a Drupal database|
|Upload path||Organize uploaded files according to admin-specified rules.|
|Upload preview||Adds image preview thumbnails to the file attachment section.|
|Used modules||Displays as a table, within a block or a page, all the modules installed on a Drupal site.|
|Views||Create customized lists and queries from your database.|
|Views exporter||Allows exporting multiple views at once.|
|Views UI||Administrative interface to views. Without this module, you cannot create or edit your views.|
|Bonus: Views Export||Plugin to export views a couple of formats including Comma Separated Values(CSV), Doc or XML|
|Views data export migration||Provides helper functions and UI for converting a views_bonus_export display to a views_data_export display|
|Views Data Export||Plugin to export views data into various file formats|
|Webform||Enables the creation of forms and questionnaires.|