Building a discourse server

Posted on May 29, 2023 by Hugh Paterson III

pfaffman/discourse-doi-resolver
https://meta.discourse.org/t/sign-in-to-discourse-using-orcid/105488/4
https://meta.discourse.org/t/discourse-category-experts/190814
https://meta.discourse.org/t/custom-category-boxes/144865
https://meta.discourse.org/t/mentionables/192948 <-- content in OLAC
https://meta.discourse.org/t/admin-guide-to-tags-in-discourse/121041

Position conversations within the OLAC search space.

https://blog.discourse.org/2021/11/discourse-forum-seo/
https://meta.discourse.org/t/does-discourse-support-google-structured-data-i-e-schema-org/58249
https://meta.discourse.org/t/beginners-guide-to-seo-with-discourse/146655/4
https://meta.discourse.org/t/beginners-guide-to-seo-with-discourse/146655/7
https://meta.discourse.org/t/discourse-sitemap/40348

This might be a way forward to an OAI-PMH repo: https://github.com/discourse/discourse-sitemap another option is to use a query mechanism in the JSON api to get all threads and treat these threads as resources for description. https://meta.discourse.org/t/discourse-rest-api-documentation/22706

I wonder how many layers a tag-group can have... https://docs.discourse.org/#tag/Tags/operation/updateTagGroup

https://meta.discourse.org/t/locations-plugin/69742

Legal and privacy considerations:

https://meta.discourse.org/t/legal-tools-plugin/87966/26

Your Discourse forum and the GDPR

Import from other discourse instances:
https://meta.discourse.org/t/create-download-and-restore-a-backup-of-your-discourse-database/122710

Self-hosting, self-managed, hosted, serviced,

https://meta.discourse.org/t/comparing-hosting-providers/100034/13

https://www.literatecomputing.com/discourse-server-maintenance/
https://meta.discourse.org/t/recommended-hosting-providers-for-self-hosters/79562

Pricing:
https://discourse.org/pricing

https://github.com/discourse/discourse/blob/main/docs/INSTALL-cloud.md

Discourse Hosting Plans and Pricing

Dedicated email:
https://messagebird.com/pricing/email-sending

OCR solutions for python toolchains

Posted on May 4, 2023 by Hugh Paterson III

https://github.com/Calamari-OCR/calamari
https://builtin.com/data-science/python-ocr
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/#

Extract Text from Images Quickly Using Keras-OCR Pipeline

https://github.com/usnistgov/ocr-pipeline

App Build and Deploy

Posted on April 26, 2023 by Hugh Paterson III

https://fly.io/docs/languages-and-frameworks/python/

Metadata Interoperability at OLAC

Posted on April 12, 2023 by Hugh Paterson III

This week we had a lecture on metadata interoperability. Interoperability is a major theme of Gary Simons work on OLAC. It was the keyword or concept that he used to push the social behavior requirements related to the activities around, in, and at language archives.

I think that across the history of OLAC there have been various understandings on the kinds of metadata needed to describe language resources. That is, discovery is the architectural goal of OLAC, but other requirements also exist. In the beginning of OLAC many of the participants were looking at OLAC for a complete solution to the kinds of metadata they should be collecting and using. The other requirements upon resource stewards have always meant additional fields in diverse institutional contexts. The freedom to explore these other requirements has not always been explored or embraced by stewards. Some have seen OLAC as an all or nothing involvement. Maybe the fear has been that there will be divergence from a communal norm.

However, my perspective is that it is quite normal for each institution to have its own metadata schema or application profile some portion of which gets shared with OLAC.

With this as background then, with the assumption that different management practices will produce different metadata schemes it seems reasonable that each institution should update their schema from time to time. This implies that metadata quality in terms of coverage or "encoding" is a moving target. Another implication then, is that even in fields which are shared with the OLAC aggregator and are defined in the OLAC metadata application profile, that those fields may have different internal syntax at different providers or at different time depths of the records creation.

The ISO639-3 field is one evidence of evolutionary change. This standard has fields which split and merge from time to time. Associating a records time of creation with a version of an institutions metadata schema is a useful dynamic when evaluating a record's quality.

The question is how should a record and the version of its applicable metadata profile be associated in the OLAC context? How should this information be communicated to record viewers?

The answer is rather straightforward, but requires two parts. The first part requires a modification to the archive profile to have two information bits:

The name of the native application profile at the data provider
A link to the native metadata application profile documentation

The documentation should be in a publicly accessible place so that the provided metadata makes sense. There are several ways this could be accomplished one way is to create a manifestation record for each iteration of the application profile. These could be related into a collection or they could have a single relation.

which in the listSet

The OLAC OAI record should have in its source in the first harvest the name and version of the native metadata schema used for the generation of the record. The link to the native version of the providers metadata schema's documentation should be provided in the archive section of the OAI describer.

Some utilities in OAI can modify data, some can be servers only, some havesters only, some harvesters and servers.

Some OAI providers are

Using record sets:

OLAC could allow end-users to dynamically create sets of records for export using the setSpec part of OAI. Playing with this and audience interest might create some social interest.

Fuzzy matching

Posted on November 19, 2022 by Hugh Paterson III

https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe

https://www.activestate.com/blog/how-to-implement-fuzzy-matching-in-python/#:~:text=As%20mentioned%20above%2C%20fuzzy%20matching,strings%20are%20to%20one%20another.

Topic Modeling in Python

Posted on November 19, 2022 by Hugh Paterson III

What if my import pd data array was the OLAC metadata schema?

https://www.semanticscholar.org/paper/A-Gentle-Introduction-to-Topic-Modeling-Using-Saxton/38742c56eadfdf11fb7218f7702c8fccfc78bd95

https://gist.github.com/umbertogriffo/5041b9e4ec6c3478cef99b8653530032

https://towardsdatascience.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576

Beginners Guide to Topic Modeling in Python

How to Use Bertopic for Topic Modeling and Content Analysis?

https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/08-Topic-Modeling-Text-Files.html

https://asandeepc.bitbucket.io/courses/inls613_summer2019/lectures/08-lda_topic_modeling.pdf

http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf

https://ourcodingclub.github.io/tutorials/topic-modelling-python/

https://stackabuse.com/python-for-nlp-topic-modeling/

https://www.toptal.com/python/topic-modeling-python

https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

Django modules and links

Posted on November 8, 2022 by Hugh Paterson III

Django application to collect submitted DOIs, acquire their API provided metadata (Bibliographic metadata and citation graph metadata), allow limited (specified) annotation, and then make those records harvestable via OAI-PMH. Language Resource tagger—Adding a layer of language related metadata to published resources.

Some Django modules for OAI-PMH
https://github.com/saw-leipzig/foaipmh
https://github.com/jnphilipp/django_oai_pmh

https://pypi.org/user/jnphilipp/ his topic extraction module looks interesting.

Also look at the xsd schema here https://github.com/saw-leipzig/foaipmh/blob/5b15d5cc4700a3cccf497c47218c2fba6b3421d5/entrypoint.prod.sh#L5

Metadata utility for OAI-PMH

https://combine.readthedocs.io/en/master/configuration.html

User Authentication
https://github.com/ubffm/django-orcid
https://django.fun/en/docs/social-docs/0.1/backends/orcid/

Crossref
https://github.com/fabiobatalha/crossrefapi

Introducing Crossref, the basics

Database Versioning
This depends on how the DB is set up. If we only have one record per item or one record per state... This needs more definition.
https://djangopackages.org/grids/g/versioning/
https://www.wpbeginner.com/beginners-guide/complete-guide-to-wordpress-post-revisions/

Form Builders
https://djangopackages.org/grids/g/form-builder/

Some Javascript tools for creating the specific forms needed:
https://github.com/HughP/dublin-core-generator
https://nsteffel.github.io/dublin_core_generator/generator.html

Markdown for documentation
https://neutronx.github.io/django-markdownx/

Bibtex
https://bibtexparser.readthedocs.io/en/master/
https://github.com/sciunto-org/python-bibtexparser
https://github.com/jnphilipp/bibliothek
https://github.com/lucastheis/django-publications <-- also check the network as "improvements" are all over the place.
Other names include:
* Babybib
* Pybtex
* Pybibliographer

APIs

ORCID
https://github.com/ORCID/python-orcid

API Tutorial: Searching the ORCID registry

Crossref API doc
https://github.com/CrossRef/rest-api-doc/blob/master/demos/crossref-api-demo.ipynb
Crossref types: https://www.crossref.org/documentation/register-maintain-records/
https://api.crossref.org/swagger-ui/index.html#/Types/get_types__id__works

Others — Mostly citation and references
http://www.scholix.org/
https://scholexplorer.openaire.eu/#/query/page=5/q=language
https://crossref.gitlab.io/knowledge_base/products/event-data/
FatCat https://fatcat.wiki/
InternetArchive Scholar https://scholar.archive.org/
Thor project https://project-thor.readme.io/docs/introduction-for-integrators
Corsscite.org
Semantic Scholar API https://api.semanticscholar.org/api-docs/graph
https://core.ac.uk/
https://opencitations.net/
https://unpaywall.org/ --> see: http://musingsaboutlibrarianship.blogspot.com/2017/11/using-oadoi-crossref-event-data-api-to.html
https://openalex.org/
https://arxiv.org/help/api/index
https://www.aminer.org/citation
https://www.aminer.org/download
https://open.aminer.cn/
https://analytics.hathitrust.org/datasets#top
https://pro.dp.la/developers/api-codex
https://pro.europeana.eu/page/apis

LCSH
https://github.com/edsu/id

MARC
For generating an ingesting MARC records
https://pymarc.readthedocs.io/en/latest/

Zotero
https://github.com/urschrei/pyzotero

Overview see: https://researchguides.smu.edu.sg/api-list/scholarly-metadata-api

ISSNs
ISSN.org is supposed to have an API.. but not sure if they do.
https://portal.issn.org/resource/ISSN/1904-0008
Any request to the portal may be automated thanks to the use of REST protocol. The download of results is also automated. This service is restricted to subscribing users. Please contact sales [at] issn.org for more information.
https://portal.issn.org/node/170

https://portal.issn.org/resource/ISSN/2549-5089#
https://portal.issn.org/resource/ISSN/2549-5089?format=json
We could also slurp the HTML for the sameAs links to other DBs if needed.

JATS
https://pypi.org/project/jatsgenerator/
https://stackoverflow.com/questions/42084165/extracting-text-from-jats-xml-file-using-python
https://github.com/sibils/jats-parser

Pandas
https://pypi.org/project/django-pandas/

Beautiful Soup

There is the issue of how do we add to a Dublin Core OAI record how it was changed over time.... I need to architect this out.

Record Provenance:
[]Explore
https://www.w3.org/TR/prov-dc/
https://www.w3.org/2011/prov/track/issues/607?changelog
http://www.ukoln.ac.uk/metadata/dcmi/collection-provenance/
https://edoc.hu-berlin.de/bitstream/handle/18452/2727/332.pdf?sequence=1&isAllowed=y
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4177195/
https://www.loc.gov/standards/mods/userguide/recordinfo.html
https://tsl.access.preservica.com/tslac-digital-preservation-framework/qualified-dublin-core-schema/
https://dl.acm.org/doi/10.5555/2770897.2770924
https://blog.datacite.org/exposing-doi-metadata-provenance/
https://dgarijo.com/papers/dc2011.pdf
https://ceur-ws.org/Vol-670/paper_3.pdf
https://ecommons.cornell.edu/bitstream/handle/1813/55327/Encoding%20Provenance%20for%20Social%20Science%20Data-final.pdf?sequence=3&isAllowed=y

Views:
1. login with ORCID
2. query APIs (DOIs, ISBNs, ISSNs, ORCID, WikiData, etc.)
3. results display and annotation
4. submission
5. List of past submissions
6. update past submission screen (same as #3?)

If we ran a module like this:
https://pybliometrics.readthedocs.io/en/latest/classes/SerialTitle.html

Then we could take a reading on where the least spoken languages appear in the most highly ranked journals and determine if there was a bias or a loss to science.

Data Examples:

Have been moved to:
https://github.com/HughP/CrossRef-to-OLAC-data-examples

PDF Extraction:
https://levelup.gitconnected.com/scrap-data-from-website-and-pdf-document-for-django-app-fa8f37010085
https://towardsdatascience.com/how-to-extract-pdf-data-in-python-876e3d0c288
https://stackoverflow.com/questions/71850349/download-a-pdf-from-url-edit-it-an-render-it-in-django
https://stackoverflow.com/questions/48882768/django-reading-pdf-files-content
https://www.geeksforgeeks.org/working-with-pdf-files-in-python/

PDF Creation:
https://docs.djangoproject.com/en/4.1/howto/outputting-pdf/
https://jeltef.github.io/PyLaTeX/current/examples/header.html

NER:

https://johnfraney.github.io/django-ner-trainer/settings/

Named Entity Recognition (NER) in Python with Spacy

Other:
https://prodi.gy/

https://realpython.com/testing-in-django-part-1-best-practices-and-examples/

here is a django app for controlling URIs for linked data vocabularies.
https://github.com/unt-libraries/django-controlled-vocabularies
as seen here https://digital2.library.unt.edu/vocabularies/agent-qualifiers/

And here is a one for source authority records.
https://github.com/unt-libraries/django-name
as seen here: https://digital2.library.unt.edu/name/nm0000001/

Link Checker
https://github.com/Kaltsoon/dead-link-checker
https://pypi.org/project/django-linkcheck/
https://github.com/bartdag/pylinkvalidator
https://stackoverflow.com/questions/43264291/in-django-how-can-i-unit-test-all-links-recursively-every-view-check-for-200-o

Scraping archives for OLAC

Posted on November 2, 2022 by Hugh Paterson III

This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.

https://www.youtube.com/watch?v=RvCBzhhydNk

https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html

The archive: http://roa.rutgers.edu/article/browse

https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/

https://www.edureka.co/blog/web-scraping-with-python/

https://www.webscrapingapi.com/python-web-scraping

Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/

NER

Named Entity Recognition (NER) in Python with Spacy

File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions

XML parsing

https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm

Platform tools for OAI harvesting

Posted on October 17, 2022 by Hugh Paterson III

So, in recent OLAC presentation I talked about enabling Omeka or Drupal via recipes for OAI harvesting. Here is some links to internet chatter on these issues.

Koha

enable Items in KOHA OAI Harvesting

https://koha-community.org/manual/18.05/en/html/webservices.html

https://forums.zotero.org/discussion/38956/export-of-zotero-citation-to-marc-format-for-import-into-koha-lms

wordpress

Day 13: Harvest data with OAI-PMH

WordPress and Drupal

https://acrl.ala.org/techconnect/post/creating-an-oai-pmh-feed-from-your-website/

eHive

https://developers.ehive.com/

Catmandu

https://librecatproject.wordpress.com/tutorial/

Omeka
https://omeka.org/classic/docs/Plugins/OaiPmhRepository/

MOAI
https://pypi.org/project/MOAI/

PyOAI
https://pypi.org/project/pyoai/

Posted in Other Journals | Tagged OAI, OAI-PMH, OLAC, PHP, Python, R-90 | Leave a reply

Post navigation

Newer posts →

Activity
August 2025

M T W T F S S

1 2 3

4 5 6 7 8 9 10

11 12 13 14 15 16 17

18 19 20 21 22 23 24

25 26 27 28 29 30 31

« Jan

I’ve been saying

Chasing subsets

New mouse buttons

Moving Apple notes

Academic Heritage in MARC records

Converting DC Subjects to Schema.org

Language Documentation Gear

Serials, MARC Records and RDA Core

Font Modulator

OLAC CMS options via XML

OLAC Collection Description and Linked Data Terms

Zotero Plugins

OLAC and User Tasks

Say What?
David Clews on German Waters
Jeff Pitts on Kinder Eier
Jeff on Plasticification of soil
Thoughts on file formats and file names in language documentation projects and archiving | The Journeyler on The Workflow Management for Linguists
Hugh Paterson III on Types of Linguistic Maps: The Mapping of linguistic Features and Researcher Interactivity

One should not consider the content on this website to be an official opinion of any company associated with me. These posts are solely my opinion.

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Tag Archives: R-90