Where in MARC records language codes are found and used

Posted on December 7, 2024 by Hugh Paterson III

Places to use Language Codes in MARC Records MARC 041

as well as MARC 546 and of course the leader.

VRA Core and its use of xml:lang

Posted on December 15, 2023 by Hugh Paterson III

Some information professionals might be confused about the use of language identification metadata in larger bibliographic metadata standards. For example, VRA Core (Visual Resources Association)is a metadata standard which is used to describe visual artifacts. It is implemented in XML and therefore takes on all the descriptive power of XML. Including the use of the xml:lang attribute.

The following observations are made using the VRA (Visual Resources Association) Core 4 XML Schema, version 0.42. This schema implements the final VRA Core 4.0 guidelines, 2007-04-09. It is important to note that in these metadata standards implemented by memory institutions there are really two parts, the first is the "guidelines" and then there is the "implementation" of those guidelines (in this case as an XSD validation file). These two documents may not always be congruent even if that is the intention. In these cases I argue that what is valid is the technical implementation over the guidelines as that seems to be the best way to argue the definitive authority.

The XSD validation document contains the following annotation around the use of the xml:lang attribute.

VRA Core metadata attributes which can be applied to virtually any element. Note that xml:lang should contain ISO 639 language codes, not the English names of languages. Although the XML Schema defines xml:lang as allowing ISO 639-2 (three-letter) codes, some validators will only accept ISO 639-1 (two-letter) codes.

This annotation is misleading. First, the VRA Core authors are trying to alert catalogers and technologists that they need to not use the full text name value as might be done in other "library oriented standards", but rather they need to use language codes. In general this is a good thing. However, the VRA authors fail to understand the XML specification. Specifically, they indicate the need to use ISO 639 language codes. This is not true. XML needs to use BCP-47 language codes. This can be found in the specification for XML 1.0 fifth edition §2.12 https://www.w3.org/TR/xml/#sec-lang-tag. It is true that BCP-47 currently calls for the use of ISO 639 codes, but this might not always be true.

A second issue with the annotation is how the annotation distinguishes use between ISO 639-2 and ISO 639-1. If there are VRA Core data consumers or producers who are not consuming or producing valid XML then this is a transmission machinery issue not a protocol issue. BCP-47 does not call for the use of ISO 639-2/3 tas when there is an equivalent ISO 639-1 tag. If data ingest processes have only implemented ingest of ISO 639-1 then they haven't implemented VRA because VRA stands on XML which stands on BCP-47. BCP-47 is an algorithm which calls upon different standards at different times. Understanding the fall back nature of the algorithm would have clarified this point for VRA authors.

The following resources are useful for a better understanding of Language Tags in XML:

Of-ness in audio recordings

Posted on May 17, 2023 by Hugh Paterson III

Subject analysis is very interesting. In a recent investigation into a theory of subject analysis, I was introduced to the concepts of: "about-ness", "is-ness", "of-ness".

Sometimes I wonder if linguists defy standard practices in subject representation, of if they define what a general population holds as a challenge with subject analysis in cataloging.

I harken to the OLAC application profile, which is based on Dublin Core. Dublin Core does not scope the subject element to "about-ness" analysis. UNT curriculum, informed and based (in structure) on Steven J. Miller', Metadata for Digital Collections: A How-To-Do-It Manual. The issue at hand is that for linguists, about-ness is only relevant for Information resources representing analysis. For other kinds of resources such as primary oral texts, or narratives captured via video which are often the object analyzed and discussed in information resources representing analysis, the primary view on subjecthood is through of-ness. As far as I know no-one has discussed audio and of-ness descriptions of audio.

It also makes me wonder if genre is mostly about utility and not about a binding style. To this end then a scholar looking for a phonology corpus, is looking for what—a combination of things—a MIMEType, with a relationship to another MIMEType, with an of-ness of a kind and a subject of "phonology".

By splitting up the concepts of: "about-ness", "is-ness", and "of-ness" it provides analytical space for more articulate descriptions in the dc:description field. But when it comes to language materials, the question is: is language a subject by virtue of "of-ness" or by virtue of "about-ness"? There are several implications here:

The description field ought to be re-thought.
The subject field ought to be re-thought.
Some searches by linguists are likely the concatenation of two or three factors: A relationship between two records, and a subject of a kind and a subject of a different kind.

Quantitative Analysis of Metadata Errors

Posted on March 13, 2023 by Hugh Paterson III

Various approaches to metadata quality assessment divide the assessment criteria into sections. For example accuracy, consistency, and completeness. However, one should ask if a quantitative approach to metadata quality assessment is better than a qualitative approach. Some may point out that the two are not mutually exclusive, and therefore not in direct competition with each other. However, I wonder if this is true. For example, if one has limited reading time does one benefit more from reading the percentage of errors relative to another error type or does one learn more by reading about the assumed noncompliance or disharmony across metadata records?

The second point in suggesting that a qualitative description of metadata quality might be better that a quantitative description is related to root causes — presumably the purposes of the investigation in the first place.

It seems to me that a quantitative approach makes the data the discussion and ignores the methods by which the data got into the observed format. For example, what were the human factors under which the metadata was produced? What was the workflow? What was the target metadata scheme at the time the records were created? What was the management implemented checking process, i.e. what were they checking for, or their metrics for success?

A qualitative analysis can show where the current process meets the management considerations. Essentially this is problem-solution fit analysis, where metadata quality is a trailing performance indicator for business processes. However, it gets interesting here because the prevailing thought is that metadata is also the way that a customers are serviced through the organization. That is, it is like a loss-leader product in that it is a product to get a customer to the main product.

Purely quantitative analysis simply announces that issues exist within a relative order. It doesn’t seek to explain the short comings using a contextual analysis.

Dynamic collections aren’t.

Posted on March 12, 2023 by Hugh Paterson III

Some years ago, scholars were debating the definition of collection. In an archival sense, and the more traditional sense, a collection refers to a direct or accumulating set of resources. In a library sense a collection may wax and wane depending on the Curation of the collection. So what is a digital collection? Especially in an aggregator of metadata?

To this question I have given some thought. The DCMIType “collection” is ambiguous on this point. Aggregations seem not to be the same as “collection” in that they are continuously updating, and may be different for different viewers! However, essentially this is the same definition that is used in libraries.

After about a year and a half of thinking about this traveling point how to do it I think I have a solution. Aggregations such as those through OAI or RSS, are not collections at all. Rather, aggregations are a view through a dynamic access point. RDA and IFLA – LRM are two models that use the concept of access points. Aggregations, in this sense of access point, our temporary applications of an access point to a resource. In RDA and IFLA – LRM these access points are hard coded on the record. This need not be the case all the time in an information retrieval system. Information retrial system can have there own coded access points independent of the data they are operation on. In this way the information retrieval system might mitigate the possible limits in the information structure of the information being retrieved. It validates the autonomy of the information retrieval system from the information.

This sort of solution preserves the definition of collection bringing sanity to the concept of collection.

OLAC data quality investigator

Posted on January 26, 2023 by Hugh Paterson III

On the flight back from Finland I found it challenging to use my laptop and pulled out my scratch pad to draw out some ideas I was having. One of those ideas was an idea for a record quality investigator. A tool which lets one investigator the presence or absence of features or sets of features in a record or set of records. The goal is to look for any patterns in the records which might be interesting and notable.

What follows are my written notes.

Mode vs. Medium

Posted on January 5, 2023 by Hugh Paterson III

Two terms which seem to be very confusable to me are Medium and Mode.

Medium relates to Format and the carrier. whereas mode is more like the classification of mediums by how they are experience. Mode related to the mode of communication. For example, Visual, linguistics, spatial, aural, or gestural.

Mode is also not to be confused with mode of issuance, which relates to if the resource is released as a single unit or a multipart unit—often over time.

Text object metadata

Posted on January 4, 2023 by Hugh Paterson III

I find that this text object metadata scheme might be useful for describing corpora.

https://www.loc.gov/standards/textMD/

I should look at these auxiliary METS extensions and include them in OLAC discussions

Rights Metadata and Rights Vocabularies

Posted on January 4, 2023 by Hugh Paterson III

In the fall term of 2022 I took a course on Metadata at UNT. In that course I encountered an interesting Rights Metadata schema create my the California Digital Libraries Project called copyrightMD. This schema is interesting because it articulates where a resource was created.

his is currently on the web here:
https://cdlib.org/groups/rights-management-group-copyrightmd/
But that website seems to not render 100% so I looked it up in the Internet Archive here:
http://web.archive.org/web/20220119153216mp_/https://cdlib.org/wp-content/uploads/2019/01/copyrightMD_user_guidelines.pdf

CopyrightMD Has been mentioned in the following academic publications:

I find the list of rights metadata schemas list in the library guide at UCF very helpful:

https://guides.ucf.edu/metadata/adminMetadata

For rights metadata, the common metadata standards such as Dublin Core include a “rights” field. Any known intellectual property rights held for the data, including access rights and rights holder, can be specified in that field. Some digital repositories provide an opportunity to assign a Creative Commons license to the materials or datasets deposited in the repository.

There are other Right Metadata standards including CopyrightMD, METSRights, ONIX For Publications Licenses, Open Digital Rights Language and XrML.

https://wiki.creativecommons.org/wiki/CC_REL

However I found that the MEts Rights schema was not linked appropriately:
https://www.loc.gov/standards/rights/
https://www.loc.gov/standards/rights/METSRights.xsd
https://www.loc.gov/standards/rights/2005version/METSRights.xsd

I personally find the statements at rightsstatements.org to be limiting:
https://rightsstatements.org/en/

The educational use permitted one is very confusing: https://rightsstatements.org/page/InC-EDU/1.0/?language=en

Note that Creative Commons used to have one like this but they did away with the whole educational use series of licenses, but I can't find them at the moment. I would have thought they might have been here: https://creativecommons.org/retiredlicenses/

http://web.archive.org/web/20100101121150/https://learn.creativecommons.org/
http://web.archive.org/web/20080714211609/http://creativecommons.org/weblog/entry/8235
https://wiki.creativecommons.org/wiki/I_want_to_make_sure_that_the_OER_I_create_are_used_only_for_truly_educational_purposes._That_means_I_should_limit_my_works_to_%E2%80%9Ceducational_use_only,%E2%80%9D_right%3F

Creative Commons Welcomes David Wiley as Educational Use License Project Lead

Real problems with academics using CC licenses:
https://smcclatchy.github.io/exp-design/LICENSE.html
Copyfraud: https://www.researchgate.net/publication/228219706_Copyfraud August 2005New York University law review (1950) 81(3)
See also: 10.5334/jcms.1021217
https://www.researchgate.net/publication/275440056_The_Public_Domain_vs_the_Museum_The_Limits_of_Copyright_and_Reproductions_of_Two-dimensional_Works_of_Art
see also: 10.1002/meet.14504701045
see also: 10.2139/ssrn.1806809
see also: https://www.researchgate.net/publication/308339459_Museums_Property_Rights_and_Photographs_of_Works_of_Art_Why_Reproduction_Through_Photograph_Should_Be_Free

see also: 10.1515/9783110732009-010 — 8 Rights Issues in the Digitization of Library Collections

~~~~
OER Notes:

U.S. Department of Education Open Licensing Rule Now in Effect

https://wiki.creativecommons.org/wiki/Creative_Commons_and_Open_Educational_Resources
https://wiki.creativecommons.org/wiki/OER_Project

OLAC query needs

Posted on December 30, 2022 by Hugh Paterson III

The following example points to the need for users to be able to sort collection by license, relationships, and extent.

I am looking for large spoken corpora of spontaneous speech in any
language (ideally > 100 hours) with a time-aligned transcription. I am
not committed to a specific genre as long as it is spontaneous speech.
It should be available as a download (for research, no commercial use),
ideally free but I may be able to pay for it as well.

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Tag Archives: metadata