https://github.com/johnfraney/django-ner-trainer
https://johnfraney.github.io/django-ner-trainer/
https://github.com/doccano/doccano
https://prodi.gy/buy
https://github.com/GitTeaching/Django_NER_Crisis?tab=readme-ov-file
https://github.com/johnfraney/django-ner-trainer
https://johnfraney.github.io/django-ner-trainer/
https://github.com/doccano/doccano
https://prodi.gy/buy
https://github.com/GitTeaching/Django_NER_Crisis?tab=readme-ov-file
https://realpython.com/python-mysql/
https://dev.mysql.com/doc/connector-python/en/preface.html
https://www.geeksforgeeks.org/working-with-pdf-files-in-python/
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://qxf2.com/blog/extracting-data-from-pdfs-python/
https://www.metachris.com/pdfx/
https://softwarerecs.stackexchange.com/questions/76210/software-to-extract-the-list-of-references-and-title-from-a-pdf-of-a-research-pa
https://pypi.org/project/refextract/
https://stackoverflow.com/questions/62365767/extract-references-from-pdf-python
https://discuss.python.org/t/pdf-extraction-with-python-wrappers/40384
https://github.com/Calamari-OCR/calamari
https://builtin.com/data-science/python-ocr
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/#
https://github.com/usnistgov/ocr-pipeline
https://fly.io/docs/languages-and-frameworks/python/
What do I want to do?
In the context of a pipeline I want to test a font against an known orthography version to know if it will support the orthography.
To that end here is a start:
https://stackoverflow.com/questions/4458696/finding-out-what-characters-a-given-font-supports
Another interesting site is: https://fontdrop.info/. It provides information from the metadata. Is list the languages it supports. But I wonder, what does "language support" or "supported languages" mean in these contexts? Where do the list of languages come from? Where are these languages' requirements cataloged?
https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe
https://www.activestate.com/blog/how-to-implement-fuzzy-matching-in-python/#:~:text=As%20mentioned%20above%2C%20fuzzy%20matching,strings%20are%20to%20one%20another.
What if my import pd data array was the OLAC metadata schema?
https://www.semanticscholar.org/paper/A-Gentle-Introduction-to-Topic-Modeling-Using-Saxton/38742c56eadfdf11fb7218f7702c8fccfc78bd95
https://gist.github.com/umbertogriffo/5041b9e4ec6c3478cef99b8653530032
https://towardsdatascience.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576
How to Use Bertopic for Topic Modeling and Content Analysis?
https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/08-Topic-Modeling-Text-Files.html
https://asandeepc.bitbucket.io/courses/inls613_summer2019/lectures/08-lda_topic_modeling.pdf
http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf
https://ourcodingclub.github.io/tutorials/topic-modelling-python/
https://stackabuse.com/python-for-nlp-topic-modeling/
https://www.toptal.com/python/topic-modeling-python
https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.
https://www.youtube.com/watch?v=RvCBzhhydNk
https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html
The archive: http://roa.rutgers.edu/article/browse
https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/
https://www.edureka.co/blog/web-scraping-with-python/
https://www.webscrapingapi.com/python-web-scraping
Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/
NER
File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions
XML parsing
https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm
So, in recent OLAC presentation I talked about enabling Omeka or Drupal via recipes for OAI harvesting. Here is some links to internet chatter on these issues.
Koha
https://koha-community.org/manual/18.05/en/html/webservices.html
wordpress
WordPress and Drupal
https://acrl.ala.org/techconnect/post/creating-an-oai-pmh-feed-from-your-website/
eHive
Catmandu
https://librecatproject.wordpress.com/tutorial/
Omeka
https://omeka.org/classic/docs/Plugins/OaiPmhRepository/
This summer I am sitting in on a computational linguistics course. It is the first instruction I have had about UNIX. Pretty Awesome.
This has required me to do some googling looking from terminal commands.
This is kind of a sketch of where I have been.
UNIX:
http://www.osxfaq.com/Tutorials/LearningCenter/
SSH:
http://kimmo.suominen.com/docs/ssh/
http://ss64.com/osx/
TERMINAL:
http://homepage.mac.com/rgriff/files/TerminalBasics.pdf
grep:
http://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/
http://en.wikipedia.org/wiki/Grep
http://www.computerhope.com/unix/ugrep.htm
Regular Expressions:
http://www.zytrax.com/tech/web/regex.htm
http://www.regular-expressions.info/tutorial.html
http://gnosis.cx/publish/programming/regular_expressions.html
RegEx and Unicode:
One of the issues that I have had with RegEx has been what is a natural class? i.e. [A-Z], [A-Za-z], [0-9], etc. As a linguist I deal a lot with IPA characters, subscripts, superscripts, unicode, and diacritics. How am I to define a natural class with these? Can I define a natural class based on the phonology of the language?
So I did some more searching:
http://unicode.org/reports/tr18/
http://unicode.org/reports/tr18/tr18-5.1.html
http://icu-project.org/docs/papers/iuc26_regexp.pdf
http://courses.ischool.berkeley.edu/i256/f06/papers/regexps_tutorial.pdf
http://wapedia.mobi/en/Regular_expression?t=5.
RegEx+PERL+Unicode:
http://perldoc.perl.org/perlretut.html
PERL:
http://www.enginsite.com/Library-Perl-Regular-Expressions-Tutorial.htm
http://www.cgi101.com/book/connect/mac.html
http://www.mactech.com/articles/mactech/Vol.18/18.09/PerlforMacOSX/index.html
© 2005-2024 Hugh Paterson III All Rights Reserved.
By submitting a comment here you grant this site a perpetual license to reproduce your Words, Name & Website URL in attribution.
Details of your viewing experience maybe retained and used. -- Copyright notice by Blog Copyright