Lexical Database Archiving Questionnaire

Featured

It's true!

I am asking around on different mailing lists to gain some insight into the archiving habits of linguists who use lexical databases. I am specifically interested in databases created by tools like FLEx, ToolBox, Lexus, TshwaneLex, etc.

Background Story Continue reading

OLAC Usage of DCMIType Collection

In the light of the following, usage guideline statements from OLAC, and my paper on Collections, I should write another note. http://www.language-archives.org/NOTE/usage.html#Type

The DCMI type Collection should be used in conjunction with one or more other applicable type terms to represent the nature of the resource as an aggregate of specific, but closely related resources, resources which could also stand on their own but were grouped together as a unit by their creator or a subsequent compiler. See Granularity of resources below for more discussion.

http://www.language-archives.org/NOTE/usage.html#Granularity%20of%20resources

Determining the right level for units to be described as language resources in the OLAC context involves multiple factors. The level of unit appropriate for inclusion in an aggregated catalog like OLAC's may be different (typically higher) than the level desirable for the catalog of a specific institution's holdings, which in turn is typically higher than the level desirable for describing the detailed contents of a resource. Section 6 of the [OLAC-Repositories] standard establishes the following basic guideline regarding the granularity of the records in an OLAC repository:

  • A metadata repository should treat resources with a single provenance as constituting a single unit with respect to OLAC metadata and should, therefore, describe them within a single record.
  • The following discussion is aimed at assisting an OLAC participant to find the right level of description.

    For a resource that has been published in some form, the appropriate unit of description for the OLAC record is the unit of the publication itself. A collective work (e.g., a festschrift) may warrant separate records for the separate papers contained within it (which should be related to the record for the work as a whole through the isPartOf and hasPart relationships; see Relation). In general, the OLAC record parallels a citable source. Thus, for published works, granularity does not really pose a problem. However, many archival resources have not been formally published. Unpublished papers that present findings of research closely parallel typical publications and can be treated in a comparable way as units for archival description.

    Granularity poses the greatest problem with primary source materials (e.g., recordings, transcriptions, annotations, notes, data sets). The typical practice of archivists is to gather such materials into collections, which in turn become the primary units of archival description (i.e., the result is resources of DCMI type collection; see Type). The collection, for example, of a single field trip may contain a large number of distinct components—the separate pieces of documentation that comprise the records of a number of distinct linguistic events (each with recording, transcription, translation, annotation, etc.). However, these components are not the units for description at the OLAC level. It is the unit of the collection that forms the basic unit for OLAC description.

    The foremost factor in determining what materials belong together as a single collection is their Provenance. In the typical case, the collected resources have a common provenance, that is, they have a common origin and history. Common origin includes who was responsible for collecting the materials, as well as when and where they were collected. It could be a single researcher or a research team, either in the context of a single trip or a series of related trips. It could also be a project that draws together materials from disparate sources for a single new research purpose, thus creating a new collection based on a secondary use of the materials. Common history is also relevant; the fact that the set of resources has been moved or changed hands or processed as a whole since it was originally collected helps to establish its identity as a single unit for archival description.

    In other cases, materials may have been placed together at some point in time well after their creation, possibly by the archive itself. If this has occurred, there should be some coherent organizing principle or intermediate stage of development that is still relevant for the proper understanding and use of the resources. Materials without a common provenance generally should not be made to constitute a collection and should not be described collectively in an OLAC description.

    Situations of shared provenance that constitute a collection for arrangement and description purposes will be distinguished by the high degree of commonality of metadata elements, e.g., same researcher(s), author(s), subject language(s), content language(s), approximate dates, coverage, linguistic type. Other metadata elements that may also be important for resource discovery but that may differ for items in the collection (e.g., format or discourse type) can be repeated at the OLAC description level for as many values as are significantly present in the collection. Alternatively, the collection may be divided into sub-collections by discourse genre, speaker, or other significant feature. The sub-collections can then be treated in distinct OLAC records (and related to the whole through the isPartOf and hasPart relationships; see Relation).

    A collection is generally described in greater detail through the use of a "finding aid" that gives the details of organization and highlights features of interest. Further characteristics of individual items within a collection (such as topic, additional contributors, event specifics, extent, format) are documented at a finer level of granularity by creating such a collection description. This should also be included as a part of the collection, preferably as a structured document or metadata set. The Object Reuse and Exchange standard currently under development by the Open Archives Initiative [OAI-ORE] offers a means of handling such descriptions as a different kind of harvestable metadata in the OAI-PMH context. The session-level detail of [IMDI] description typically aligns with the finer level of description in OAI-ORE rather than the level of Dublin Core description in OAI-PMH.

    OLAC data quality investigator

    On the flight back from Finland I found it challenging to use my laptop and pulled out my scratch pad to draw out some ideas I was having. One of those ideas was an idea for a record quality investigator. A tool which lets one investigator the presence or absence of features or sets of features in a record or set of records. The goal is to look for any patterns in the records which might be interesting and notable.

    What follows are my written notes.

    Page 1
    Page 2
    Page 3
    Page 4
    Page 5
    Page 6
    Page 7
    Page 8

    Resetting OLAC Documentation

    OLAC Documentation is a set independently evolving documents with their own version numbers. What if these documents evolved together in sync? would it make the process more manageable? What if the documentation looked different? More like modern documentation?

    I like the following document presentation layouts:

    https://github.com/bep/docuapi
    https://github.com/matcornic/hugo-theme-learn
    https://github.com/alex-shpak/hugo-book
    https://github.com/google/docsy
    https://github.com/h-enk/doks

    Next step is to take the most recent versions of OLAC documents and convert them to markdown. This converter catches all the XML tags while others don't https://codebeautify.org/html-to-markdown

    On the list are:

    http://www.language-archives.org/OLAC/repositories.html
    http://www.language-archives.org/OLAC/metadata.html
    http://www.language-archives.org/REC/bpr.html
    http://www.language-archives.org/REC/olac-extensions.html
    And each of the extensions.
    These need to be cast one way for devs and implementers of technology while another way for managers and archivists.

    OLAC Validator Custom Messages

    OLAC Validator custom messages can be created following these steps:
    https://xerces.apache.org/xerces2-j/faq-xs.html#faq-4

    This is the software it uses for its validator: https://xerces.apache.org/xerces-p/samples/validator.html Ideally this would also be containerized with the other parts.

    One approach to get this containerized might be to use this script (which is older and linux oriented) https://github.com/dgricci/xmllint

    Another option is to use: https://hub.docker.com/r/isaitb/xml-validator

    If this service were implemented on a new server, with a web-interface we might expect to use a newer HTML front end.

    here is what I found via gitub:

    https://github.com/ebruchez/darius-xml.js
    https://github.com/fulvio999/jxmlutil

    Darius looks more promising but neither are "out of the box" tools.

    Fixing Brio

    My parents got my brother and I Brio trains around 1984. Over the years some of those pieces broke and were repaired. Some of them got additional shades of pigment added. They survived through seven kids and now my two kids. Well almost survived.

    Undoing previous repairs

    Today I repaired three pieces by adding pegs. I made the pegs by drilling a 3/8” hole in a wooden ball I bought from a craft store. Then I glued in a dowel from the hardware store. Next I sanded the tip flat. Then I sanded the circumference of the ball to give clearance on the sides. Then I pre-drilled the track portion and dry-fit the pegs measuring the amount I needed both in the track and for appropriate connection to the next track piece. I then cut the dowel to proper size. Finally I glued the peg in with wood glue.

    Dry-fit for usability.
    Three fixed pieces