Engaging Archives

In my Lexical Database Archiving Questionnaire I begin to talk about engaging archives to find out which archives have lexical databases and what languages they are describing. During the initial stages of this discussion there are several issues or questions which have come to the surface which deserve an answer. I will do my best to record these thoughts here (please realize that this page will be updated periodically).

Why not look at OLAC?

Date added:12. December 2013
Short answer: We do and have.
Show Long Answer

This is a great question, I have and I do search OLAC records. I will be posting more about the role OLAC records have had in this project.

So far I can say:

  1. Not every organization (archive) submits the same kind of records. - (project records vs. item records)
  2. When institutions do provide multiple records for the same item (describing different scopes of the object) then these records are not always clearly linked.
  3. Not all records are tagged the same way within the same organizations. -
  4. Not all institutions, nor all records from some institutions have the minimum valid metadata for the OLAC 1.1 standard.
  5. Some organizations prevent this level (Lexical Database) of metadata from being distributed by policy - (SIL).
  6. Sometimes "Toolbox files" when mentioned in records refer to text markups rather than lexical datasets.
  7. Not all OLAC data providers provide end pages to the items they mention (Hawai'i, TLA, U-de, etc.)

What about privacy? When asking archives to divulge their records, sometimes archives can only provide general information, not as detailed as you are currently asking for.

Date added: 12. December 2013
Short answer: If it is too sensitive to let the world know it exists, then don't share it with us.
Show Long Answer

This is another great point. It deserves a detailed answer. I have put it on its own page. You can find what I would suggest is a simple taxonomy of archive resources on its own page. The Taxonomy of Archivable Resources and their publicity levels. According to this taxonomy, we are interested in anything that would fit in Grey, Yellow, and Green boxes. We are not interested in anything which would fit in the Red box.

What is 'an Archive'?

Date added: 14. December 2013
Show Answer

This question is asked more than I initially thought it would be. It deserves an answer.

Engaging People

The following questions are often asked by individuals but also pertain to institutions as well.

What is a lexical Resource?

Date added: 14. December 2013
Short answer: It varies.
Show Long Answer

This is an important question. I don't have all the answers. As I get them, I will modify my page here, which discusses this issue.

What is the difference between archiving and back-up?

Date added: 13. December 2013
Short answer: Back-up protects us from data-loss, whereas archiving inducts the data into practices of: data preservation, formal description, systematic access, and data protection.
Show Long Answer

Archiving and Back-up are different.
First lets explain what we mean by, "Copy" and "Back-up" and then we will contrast this with what we mean by "Archive". In the IT industry there are generally three kinds of Back-up (explained below). Additionally, we often see the word "archive" associated with IT products. e.g. Google Mail (Gmail) has a feature called "archive", as do many IMAP email systems. Amazon Cloud Storage promotes their some of their data storage products as being "good for 'archiving'". We take issue with how the term is used in these contexts and clarify what we mean by archiving below.

'Copy' and 'Back-up'

  1. Same Drive - Onsite :: we call this a Copy. - If your computer is stolen or the drive goes bad both "copies" are lost. This is not a back-up.
  2. Separate Drive - Onsite :: This is where a copy of the data lives on a second drive, but the drive is in the same location as you computer where the file exists. If one of the two drives dies then the data is recoverable from the second drive. However, because both drives are in the same location, it is highly probable that if there was a catastrophic event: Fire, Explosion (war), or theft, that both devices would be rendered unusable. An example of this kind of back-up solution for OS X is Time-Machine.
  3. Separate Drive - Offsite :: This kind of back-up solution may come in two varieties: (1) Same Region or (2) Different Region.
    • Same Region works like this: The person backing things up takes the drive and makes a back-up copy and passes the drive off to a custodian who stores the drive (and data) entrusted to them in a secondary location in some other part of the city or small country. Often for this to work well, it must be done at regular intervals. In one language documentation project Hugh Paterson was involved in he used Time-Machine and replaced data on the offsite disk weekly. - Note: many linguistic field projects, including SIL entities around the world have solutions at this level.
    • Different Region works like this: The person backing things up (usually) uses a dynamic service which stores a copy of their hard drive in one or more data centers. There are several commercial services which work like this but all function slightly different. For example: CrashPlanPROe, Carbonite, or RebuSync (though RebuSync as of 2010 did not keep versions of files, and was difficult to work with the large file size of primary data in a language documentation project). At this level of back-up solution, if a datacenter in a hurricane zone like Florida was destroyed, then the data would be expected to be stored somewhere else like Los Angles, or Tokyo. The data could then be restored or accessed from this second location.

    Note: What about Google Drive, Sugar Sync, or DropBox aren't these back-up solutions? No. We would not classify these as back-up solutions. We classify them as collaboration and file sharing solutions. Here is why: when these kinds of solutions are used they, (typically) sync materials from your computer to the user's cloud account. If a user accidentally deletes their content from their computer then this deletion is also replicated to these remote file stores. Thereby also deleting the file in the offsite location. (We recognize that some services do offer versioning which does give users limited capability to recover deleted files, but these features are not automatic, and usually come bundled with premium version of these products/services.)

So then what does 'Archived' mean and how is it different from 'Back-up'? - Whereas back-up is primarily concerned with data loss prevention, Archiving is concerned with preservation of usability (of the data), discoverability, provenance (history) and identification (of the data), and then also access to data.

If my Data is in the cloud does that mean it is archived?

Date added: 13. December 2013
Short answer: No.
Show Long Answer

This is also a great question, because there are a lot of issues involved with cloud data. First cloud data is often social, and implies variation though versions (updates or changes to the same dataset) and forks (dataset splits where each set is then modified independently). Second cloud data is not on the local machine and therefore can feel to some like an "offsite back-up" solution.

With regards to lexical datasets SIL offers two independent cloud services:

  • languagedepot.org - A web service which enables the send and receive functions of lexical dataset building teams to share their data with each other through FLEx's built in Send/Receive function.
  • webonary.org - A website where FLEx data can be hosted and viewed.

http://en.wikipedia.org/wiki/Gnolia
http://www.wired.com/business/2009/01/magnolia-suffer/

First lets address the cloud
social data and the iterative nature of lexical resources. There are lots of tools like DropBox, SugarSync, Google Drive, etc. These are not necessarily even successful back-up strategies.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.