While doing my Masters Thesis, I took a look at the contributor roles declared for various works. One thing I noticed is that even though Stuart McGill contributed two corpora to ELAR when these corpora get translated to OALC the translation mucks the metadata so that only one resource shows up with his name.
Django application to collect submitted DOIs, acquire their API provided metadata (Bibliographic metadata and citation graph metadata), allow limited (specified) annotation, and then make those records harvestable via OAI-PMH. Language Resource tagger—Adding a layer of language related metadata to published resources.
Some Django modules for OAI-PMH
https://github.com/saw-leipzig/foaipmh
https://github.com/jnphilipp/django_oai_pmh
User Authentication
https://github.com/ubffm/django-orcid
https://django.fun/en/docs/social-docs/0.1/backends/orcid/
Database Versioning
This depends on how the DB is set up. If we only have one record per item or one record per state... This needs more definition.
https://djangopackages.org/grids/g/versioning/
https://www.wpbeginner.com/beginners-guide/complete-guide-to-wordpress-post-revisions/
Form Builders
https://djangopackages.org/grids/g/form-builder/
Some Javascript tools for creating the specific forms needed:
https://github.com/HughP/dublin-core-generator
https://nsteffel.github.io/dublin_core_generator/generator.html
Markdown for documentation
https://neutronx.github.io/django-markdownx/
Bibtex
https://bibtexparser.readthedocs.io/en/master/
https://github.com/sciunto-org/python-bibtexparser
https://github.com/jnphilipp/bibliothek
https://github.com/lucastheis/django-publications <-- also check the network as "improvements" are all over the place.
Other names include:
* Babybib
* Pybtex
* Pybibliographer
ISSNs
ISSN.org is supposed to have an API.. but not sure if they do.
https://portal.issn.org/resource/ISSN/1904-0008
Any request to the portal may be automated thanks to the use of REST protocol. The download of results is also automated. This service is restricted to subscribing users. Please contact sales [at] issn.org for more information.
https://portal.issn.org/node/170
https://portal.issn.org/resource/ISSN/2549-5089#
https://portal.issn.org/resource/ISSN/2549-5089?format=json
We could also slurp the HTML for the sameAs links to other DBs if needed.
Views:
1. login with ORCID
2. query APIs (DOIs, ISBNs, ISSNs, ORCID, WikiData, etc.)
3. results display and annotation
4. submission
5. List of past submissions
6. update past submission screen (same as #3?)
If we ran a module like this:
https://pybliometrics.readthedocs.io/en/latest/classes/SerialTitle.html
Then we could take a reading on where the least spoken languages appear in the most highly ranked journals and determine if there was a bias or a loss to science.
Data Examples:
Have been moved to:
https://github.com/HughP/CrossRef-to-OLAC-data-examples
PDF Extraction:
https://levelup.gitconnected.com/scrap-data-from-website-and-pdf-document-for-django-app-fa8f37010085
https://towardsdatascience.com/how-to-extract-pdf-data-in-python-876e3d0c288
https://stackoverflow.com/questions/71850349/download-a-pdf-from-url-edit-it-an-render-it-in-django
https://stackoverflow.com/questions/48882768/django-reading-pdf-files-content
https://www.geeksforgeeks.org/working-with-pdf-files-in-python/
PDF Creation:
https://docs.djangoproject.com/en/4.1/howto/outputting-pdf/
https://jeltef.github.io/PyLaTeX/current/examples/header.html
If abstract is a sample of about-ness, then a table of contents is sample if is-ness. Some have said that journal articles should not have table of contents (instructional staff at the UNT program teaching the Metadata I course). I disagree, but so does Habing, et al (2001). Sometimes more than an abstract a table of contents can deliver a substantial understanding of what an article is and is about by displaying its structure. In fact many law review articles actually include a table of contents prior to the main part of the article. Law review articles can be over 70 pages in length. An outline offers useful information to the potential reader.
An example of an outline from a linguistics article.
Roberts, David. 2011. “A Tone Orthography Typology.” Written Language & Literacy 14 (1): 82–108. doi:10.1075/wll.14.1.05rob.
Introduction
The six parameters
2.1 First parameter: Domain
2.2 Second parameter: Target
2.2.1 Tones
2.2.2 Grammar
2.2.3 Lexicon
2.2.4 Dual strategies
2.3 Third parameter: Symbol
2.3.1 Phonographic representations
2.3.2 Semiographic representations
2.4 Fourth parameter: Position
2.5 Fifth parameter: Density
2.5.1 Introduction
2.5.2 Zero density
2.5.3 Partial density
2.5.4 Exhaustive density
2.6 Sixth parameter: Depth
2.6.1 Introduction
2.6.2 Surface representation
2.6.3 Deep representation
2.6.4 Shallow (transparent) representation
Thomas G. Habing, Timothy W. Cole, and William H. Mischo. 2001. Qualified Dublin Core using RDF for Sci-Tech Journal Articles. https://dli.grainger.uiuc.edu/Publications/metadatacasestudy/HabingDC2001.pdf
Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
AACR2 and RDA both constitute application profiles using the same database structure known as MARC. MARC defines the fields and the expected values within those fields (type control) while AACR2 and RDA compose definitions of cognitive models and data fingerprinting (not the LIS terms for these concepts). By cognitive model I mean the mental representation of entities and their relationships and by fingerprinting of data I mean that some artifacts are "well described" when various fields are employed. E.g., a book description needs a publisher, while a manuscript does not.
AACR2 and RDA both constitute application profiles to which the documentation is only provided on a subscription basis. This is a pay-to-play game. This sort of game is not well received by the language documentation community. These facts do no mean that preservation organizations need to avoid MARC, rather a MARC profile could be established and documented in the open.
When considering the future of OLAC and language resource archiving an outstanding question emerges, is this sort of profile something that is of interest within the community?
Dublin Core has a subject element. But what constitutes a subject?
Two points on this:
Subject-hood is a complex notion. As pointed out by Birger Hjørland included in this concept can be both is-ness and about-ness. LIS theory can say to divide these concepts, but if Dublin Core as a descriptive framework does not allow this, then the notion of subjecthood should be assumed to include both notions.
Pictures (still images, including paintings) are complex when evaluating their subject hood. First, when a picture depicts something then it is reasonable to say that the picture is about that thing, as well as the picture is something...
I am suggesting that Dublin Core as a standard does not distinguish between about-ness and is-ness with regard to subject. And to further make matters complicated about-ness and is-ness merge more in visual media than in other types of print based media.
The following articles indirectly address the distinction of about-ness and is-ness or address about-ness in visual media.
Wanda Klenczon & Paweł Rygiel (2014) Librarian Cornered by Images, or How to Index Visual Resources, Cataloging & Classification Quarterly, 52:1, 42-61, DOI: 10.1080/01639374.2013.848123
in a book
Emerging Frameworks and Methods: CoLIS 4 : Proceedings of the Fourth
Andrea Witcomb (1997) On the Side of the Object: an Alternative Approach to Debates About Ideas, Objects and Museums, Museum Management and Curatorship, 16:4, 383-399, DOI: 10.1080/09647779700501604
Wang, X., Song, N., Liu, X. and Xu, L. (2021), "Data modeling and evaluation of deep semantic annotation for cultural heritage images", Journal of Documentation, Vol. 77 No. 4, pp. 906-925. https://doi.org/10.1108/JD-06-2020-0102
So, in recent OLAC presentation I talked about enabling Omeka or Drupal via recipes for OAI harvesting. Here is some links to internet chatter on these issues.