HTML Metadata tags and Dublin Core

Posted on November 22, 2022 by Hugh Paterson III

https://infosci.um.ac.ir/index.php/RRP/article_27183.html?lang=en
https://doi.org/10.1080/13614579709516904
https://archive.ifla.org/documents/libraries/cataloging/metadata/drusch.pdf
http://www.ariadne.ac.uk/issue/10/dublin/
https://www.sid.ir/paper/102563/en
https://crln.acrl.org/index.php/crlnews/article/view/18374/20723
https://mn.gov/bridges/user2study.pdf
http://eprints.rclis.org/7319/
http://eprints.rclis.org/7319/1/Search_Engines_and_Resource_Discovery.pdf

What Happened to Dublin Core as an SEO Factor?

https://doi.org/10.1177/0165551504045851
https://www.seroundtable.com/google-on-using-dublin-core-schema-29002.html

https://espace.library.uq.edu.au/data/UQ_7837/final.html
https://espace.library.uq.edu.au/view/UQ:7837 <-- What is it with repositories and asking for human verification? Isn't the point of these to be machine crawlable...? Same thing with SIL

https://www.seroundtable.com/google-on-using-dublin-core-schema-29002.html

Subjects around DC: https://muse.jhu.edu/article/520975/pdf

Subjects for images

Posted on November 5, 2022 by Hugh Paterson III

Somebody told me once that pictures don't have subjects because of the is-ness about-ness separation:

I disagree. Here are some things from the literature.

https://drum.lib.umd.edu/bitstream/handle/1903/15063/Describing_Visual_Materials_in_the_Digital_Age_Hamburger.pdf

http://duspeccoll.github.io/local_authority

https://journals.ala.org/index.php/lrts/article/viewFile/7564/10462

https://listserv.loc.gov/cgi-bin/wa?A2=ind0501&L=MARC&P=4254

https://inevermetadataididntlike.wordpress.com/category/library-of-congress-genreform-terms/

http://netanelganin.com/projects/lcgft/lcgftType.html

https://cornerstone.lib.mnsu.edu/cgi/viewcontent.cgi?article=1000&context=olac-publications

https://www.isko.org/cyclo/subject

MODS and element order

Posted on November 5, 2022 by Hugh Paterson III

Is element order a thing in XML? That is is the order of appearance of sibling elements within an XML document critical?

https://stackoverflow.com/questions/28268696/is-the-order-of-two-siblings-implementation-dependent
https://xmltutorial.info/xml/node-relationships/

Here is the response from the XSD author:

I haven’t read the entire thread, but I take it the question is whether elements in a mods record need to be in a particular order (i.e. in the order that they are listed in the schema). They don’t.

In the MODS schema, look for:
*********************************************************************** ** Definition of a single MODS record ** ********************************************************************** 
And following that:
<xs:element name="mods" type="modsDefinition"/>

<xs:complexType name="modsDefinition">
<xs:group ref="modsGroup" maxOccurs="unbounded"/>
……….

This says: a MODS record consists of one or more elements from the “modsGroup (at least one, because that is the default if there is no minOccurs, and as many as you want because maxOccurs=“unbounded”) enclosed within a element.

Next, look for:
*********************************************************************** ** These are the "top level" MODS elements ** ********************************************************************** —>
prior to that:
<xs:group name="modsGroup”>
<xs:choice>

…. and following it is the list of elements:


<xs:element ref="abstract"/>
<xs:element ref="accessCondition"/>
<xs:element ref="classification"/>
<xs:element ref="extension"/>
<xs:element ref="genre"/>
<xs:element ref="identifier"/>
<xs:element ref="language"/>
<xs:element ref="location"/>
……………. and so on.

“Choice: says “choose any one of these elements."

So all together, it says choose an elements from the list. Any element. And then repeat as desired.

So you could choose “genre”, and then choose “classification”, and so on. Chosen in no particular order.

And then enclose your list within a
<mods>
record, in the order in which you chose the elements.

Ray

Schema.org templates

Posted on November 5, 2022 by Hugh Paterson III

Some links to some schema.org templates and documentation.

Schema Markup for Colleges and Universities

Person JSON-LD Examples

sometimes these are more useful than the official site.

Scraping archives for OLAC

Posted on November 2, 2022 by Hugh Paterson III

This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.

https://www.youtube.com/watch?v=RvCBzhhydNk

https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html

The archive: http://roa.rutgers.edu/article/browse

https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/

https://www.edureka.co/blog/web-scraping-with-python/

https://www.webscrapingapi.com/python-web-scraping

Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/

NER

Named Entity Recognition (NER) in Python with Spacy

File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions

XML parsing

https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm

MARC profiles for Language Archives

Posted on October 31, 2022 by Hugh Paterson III

AACR2 and RDA both constitute application profiles using the same database structure known as MARC. MARC defines the fields and the expected values within those fields (type control) while AACR2 and RDA compose definitions of cognitive models and data fingerprinting (not the LIS terms for these concepts). By cognitive model I mean the mental representation of entities and their relationships and by fingerprinting of data I mean that some artifacts are "well described" when various fields are employed. E.g., a book description needs a publisher, while a manuscript does not.

AACR2 and RDA both constitute application profiles to which the documentation is only provided on a subscription basis. This is a pay-to-play game. This sort of game is not well received by the language documentation community. These facts do no mean that preservation organizations need to avoid MARC, rather a MARC profile could be established and documented in the open.

When considering the future of OLAC and language resource archiving an outstanding question emerges, is this sort of profile something that is of interest within the community?

Dublin Core Subject field

Posted on October 31, 2022 by Hugh Paterson III

Dublin Core has a subject element. But what constitutes a subject?

Two points on this:

Subject-hood is a complex notion. As pointed out by Birger Hjørland included in this concept can be both is-ness and about-ness. LIS theory can say to divide these concepts, but if Dublin Core as a descriptive framework does not allow this, then the notion of subjecthood should be assumed to include both notions.
Pictures (still images, including paintings) are complex when evaluating their subject hood. First, when a picture depicts something then it is reasonable to say that the picture is about that thing, as well as the picture is something...

I am suggesting that Dublin Core as a standard does not distinguish between about-ness and is-ness with regard to subject. And to further make matters complicated about-ness and is-ness merge more in visual media than in other types of print based media.

The following articles indirectly address the distinction of about-ness and is-ness or address about-ness in visual media.

Rushton, M. Public Funding of Controversial Art. Journal of Cultural Economics 24, 267–282 (2000). https://doi.org/10.1023/A:1007682121108

Wall, J. M. (2005). The Medium & the Message: Theology and Film. Theology Today, 62(1), 74–77. https://doi.org/10.1177/004057360506200109

Wanda Klenczon & Paweł Rygiel (2014) Librarian Cornered by Images, or How to Index Visual Resources, Cataloging & Classification Quarterly, 52:1, 42-61, DOI: 10.1080/01639374.2013.848123

in a book
Emerging Frameworks and Methods: CoLIS 4 : Proceedings of the Fourth

Andrea Witcomb (1997) On the Side of the Object: an Alternative Approach to Debates About Ideas, Objects and Museums, Museum Management and Curatorship, 16:4, 383-399, DOI: 10.1080/09647779700501604

Wang, X., Song, N., Liu, X. and Xu, L. (2021), "Data modeling and evaluation of deep semantic annotation for cultural heritage images", Journal of Documentation, Vol. 77 No. 4, pp. 906-925. https://doi.org/10.1108/JD-06-2020-0102

OLAC and Library of Congress Demographic Group Terms

Posted on October 31, 2022 by Hugh Paterson III

Library of Congress Demographic Group Termshttps://id.loc.gov/authorities/demographicTerms.html

OLAC and some genre terms

Posted on October 31, 2022 by Hugh Paterson III

I need to explore connoncial equivalences between some genre terms and OLAC terms.

For example: https://www.loc.gov/standards/valuelist/marcgt.html and http://www.loc.gov/standards/sourcelist/genre-form.html

Raw questions when looking at the MODS documentation

Posted on October 30, 2022 by Hugh Paterson III

https://www.loc.gov/standards/mods/userguide/identifier.html

MODS documentation does not explain an expected syntax for the examples. This would be very helpful. What is the expected syntax for typeURI?

http://loc.gov/standards/mods/userguide/typeofresource.html
manuscript

Definition
A resource that is written in handwriting or typescript.
Application
This attribute is used as manuscript="yes" when a collection contains manuscripts and is considered generally to be manuscript in nature, and for individual manuscripts.

A collection is not the same thing as a DC collection, so what does the XSD say?

Where does the OLAC grene terms map to this list?

https://www.loc.gov/standards/valuelist/marcgt.html

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Tag Archives: metadata

HTML Metadata tags and Dublin Core

Subjects for images

MODS and element order

Schema.org templates

Scraping archives for OLAC

MARC profiles for Language Archives

Dublin Core Subject field

OLAC and Library of Congress Demographic Group Terms

OLAC and some genre terms

Raw questions when looking at the MODS documentation