Skip to primary content
Skip to secondary content

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

The Journeyler

Main menu

  • Home
  • CV/Resume
  • Photography
  • Location
    • Cartography
    • Geo-Tagging
    • GPS
  • Language Documentation
    • Linguistics
    • Digital Archival
  • Open Drafts
  • Archives

Tag Archives: metadata

Post navigation

← Older posts
Newer posts →

Scraping archives for OLAC

Posted on November 2, 2022 by Hugh Paterson III
Reply

This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.

https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html

The archive: http://roa.rutgers.edu/article/browse

https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/

https://www.edureka.co/blog/web-scraping-with-python/

https://www.webscrapingapi.com/python-web-scraping

Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/

NER

NLP Application: Named Entity Recognition (NER) in Python with Spacy

File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions

XML parsing

https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm

Posted in Other Journals | Tagged bs4, metadata, OLAC, Python | Leave a reply

MARC profiles for Language Archives

Posted on October 31, 2022 by Hugh Paterson III
Reply

AACR2 and RDA both constitute application profiles using the same database structure known as MARC. MARC defines the fields and the expected values within those fields (type control) while AACR2 and RDA compose definitions of cognitive models and data fingerprinting (not the LIS terms for these concepts). By cognitive model I mean the mental representation of entities and their relationships and by fingerprinting of data I mean that some artifacts are "well described" when various fields are employed. E.g., a book description needs a publisher, while a manuscript does not.

AACR2 and RDA both constitute application profiles to which the documentation is only provided on a subscription basis. This is a pay-to-play game. This sort of game is not well received by the language documentation community. These facts do no mean that preservation organizations need to avoid MARC, rather a MARC profile could be established and documented in the open.

When considering the future of OLAC and language resource archiving an outstanding question emerges, is this sort of profile something that is of interest within the community?

Posted in Other Journals | Tagged MARC, metadata, OALC, RDA | Leave a reply

Dublin Core Subject field

Posted on October 31, 2022 by Hugh Paterson III
Reply

Dublin Core has a subject element. But what constitutes a subject?

Two points on this:

  1. Subject-hood is a complex notion. As pointed out by Birger Hjørland included in this concept can be both is-ness and about-ness. LIS theory can say to divide these concepts, but if Dublin Core as a descriptive framework does not allow this, then the notion of subjecthood should be assumed to include both notions.
  2. Pictures (still images, including paintings) are complex when evaluating their subject hood. First, when a picture depicts something then it is reasonable to say that the picture is about that thing, as well as the picture is something...

I am suggesting that Dublin Core as a standard does not distinguish between about-ness and is-ness with regard to subject. And to further make matters complicated about-ness and is-ness merge more in visual media than in other types of print based media.

The following articles indirectly address the distinction of about-ness and is-ness or address about-ness in visual media.

Rushton, M. Public Funding of Controversial Art. Journal of Cultural Economics 24, 267–282 (2000). https://doi.org/10.1023/A:1007682121108

Wall, J. M. (2005). The Medium & the Message: Theology and Film. Theology Today, 62(1), 74–77. https://doi.org/10.1177/004057360506200109

Wanda Klenczon & Paweł Rygiel (2014) Librarian Cornered by Images, or How to Index Visual Resources, Cataloging & Classification Quarterly, 52:1, 42-61, DOI: 10.1080/01639374.2013.848123

in a book
Emerging Frameworks and Methods: CoLIS 4 : Proceedings of the Fourth

Andrea Witcomb (1997) On the Side of the Object: an Alternative Approach to Debates About Ideas, Objects and Museums, Museum Management and Curatorship, 16:4, 383-399, DOI: 10.1080/09647779700501604

Wang, X., Song, N., Liu, X. and Xu, L. (2021), "Data modeling and evaluation of deep semantic annotation for cultural heritage images", Journal of Documentation, Vol. 77 No. 4, pp. 906-925. https://doi.org/10.1108/JD-06-2020-0102

Posted in Other Journals | Tagged Dublin core, metadata, OLAC, UNT-notes | Leave a reply

OLAC and Library of Congress Demographic Group Terms

Posted on October 31, 2022 by Hugh Paterson III
Reply

Library of Congress Demographic Group Termshttps://id.loc.gov/authorities/demographicTerms.html

Posted in Other Journals | Tagged audience, metadata, OLAC | Leave a reply

OLAC and some genre terms

Posted on October 31, 2022 by Hugh Paterson III
Reply

I need to explore connoncial equivalences between some genre terms and OLAC terms.

For example: https://www.loc.gov/standards/valuelist/marcgt.html and http://www.loc.gov/standards/sourcelist/genre-form.html

Posted in Other Journals | Tagged Genre terms, MARC, metadata, OLAC | Leave a reply

Raw questions when looking at the MODS documentation

Posted on October 30, 2022 by Hugh Paterson III
Reply

https://www.loc.gov/standards/mods/userguide/identifier.html

MODS documentation does not explain an expected syntax for the examples. This would be very helpful. What is the expected syntax for typeURI?

http://loc.gov/standards/mods/userguide/typeofresource.html
manuscript

Definition
A resource that is written in handwriting or typescript.
Application
This attribute is used as manuscript="yes" when a collection contains manuscripts and is considered generally to be manuscript in nature, and for individual manuscripts.

A collection is not the same thing as a DC collection, so what does the XSD say?

Where does the OLAC grene terms map to this list?

https://www.loc.gov/standards/valuelist/marcgt.html

Posted in Other Journals | Tagged metadata, MODS | Leave a reply

Schema.org

Posted on October 18, 2022 by Hugh Paterson III
Reply

Some links and papers on schema.org.

Breadcrumb

https://schema.org/WebSite

https://neilpatel.com/blog/get-started-using-schema/

https://developers.google.com/search/docs/appearance/structured-data/image-license-metadata

https://search.google.com/test/rich-results/result/r%2Fevents?id=_s8HGEDUyCAtztd6qexRyA

https://github.com/wowchemy/wowchemy-hugo-themes/blob/main/modules/wowchemy-seo/layouts/partials/jsonld/event.html

Posted in Marketing, Meta-data | Tagged metadata, schema.org | Leave a reply

OLAC and PREMIS

Posted on September 27, 2022 by Hugh Paterson III
Reply

During the presentation yesterday on PREMIS, which I liked very much, it seemed that a multi-author survey of archival institutions which conduct media transference and also participate in OLAC, would be an interesting group project which would have publishable results.

Such a project might have three components:

  1. A “crosswalk style” mapping between PREMIS data values or fields and where those may appear in the OLAC profile. (How would one know if this information were available? or put another way, if an institution were to use this framework how could they publish that metadata via the OLAC profile?)
  2. A review of current practice as evidenced by what we can see in OLAC records. (Are institutions contributing this metadata? and in what capacitiy?)
  3. A questionnaire or interview with specific institutions conducting digitization. Offhand I think the following do digitization: Alaska Native Language Archive, Kaipuleohone,  SIL L&CA, PARIDESIC, Oxford Text Archive, and the COllections de COrpus Oraux Numeriques (CoCoON ex-CRDO). (In what ways are institutions solving the problems that PREMIS helps to solve?) 
https://www.loc.gov/standards/premis/
Posted in Other Journals | Tagged metadata, OLAC | Leave a reply

Reflex Notes

Posted on June 21, 2021 by Hugh Paterson III
Reply

These are my sketch journal notes for reflex.

Posted in Other Journals | Tagged corpora, historical linguistics, metadata, reflex | Leave a reply

Data Anonymized?

Link

Reply

I have found the following two links helpful when considering data anonymization and privacy issues in general.

http://www.caida.org/data/anonymization/

http://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/

Posted on January 14, 2014 by Hugh Paterson III | Leave a reply

Post navigation

← Older posts
Newer posts →

Activity

March 2023
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  
« Feb    

I’ve been saying

  • Font queries
  • Making a Violin
  • Twitter self hosted archive…
  • Spatial Coverage on the OLAC network
  • Python traps
  • Quantitative Analysis of Metadata Errors
  • Dynamic collections aren’t.
  • All a board
  • Adding Outputs to XLingPaper
  • Carries-free Kids
  • Oregon state Supreme Court
  • XLingPaper via Javascript

Say What?

  • kristina Oma Cartwright on Carries-free Kids
  • kristina Oma Cartwright on Florence Beach
  • kristina Oma Cartwright on Oregon state Supreme Court
  • kristina Oma Cartwright on Craft day!
  • mom on Identifying as female

One should not consider the content on this website to be an official opinion of any company associated with me. These posts are solely my opinion.

Proudly powered by WordPress
 

Loading Comments...