https://github.com/Calamari-OCR/calamari
https://builtin.com/data-science/python-ocr
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/#
https://github.com/usnistgov/ocr-pipeline
https://github.com/Calamari-OCR/calamari
https://builtin.com/data-science/python-ocr
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/#
https://github.com/usnistgov/ocr-pipeline
https://fly.io/docs/languages-and-frameworks/python/
What do I want to do?
In the context of a pipeline I want to test a font against an known orthography version to know if it will support the orthography.
To that end here is a start:
https://stackoverflow.com/questions/4458696/finding-out-what-characters-a-given-font-supports
Another interesting site is: https://fontdrop.info/. It provides information from the metadata. Is list the languages it supports. But I wonder, what does "language support" or "supported languages" mean in these contexts? Where do the list of languages come from? Where are these languages' requirements cataloged?
https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe
What if my import pd data array was the OLAC metadata schema?
https://www.semanticscholar.org/paper/A-Gentle-Introduction-to-Topic-Modeling-Using-Saxton/38742c56eadfdf11fb7218f7702c8fccfc78bd95
https://gist.github.com/umbertogriffo/5041b9e4ec6c3478cef99b8653530032
https://towardsdatascience.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576
https://www.holisticseo.digital/python-seo/topic-modeling/
https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/08-Topic-Modeling-Text-Files.html
https://asandeepc.bitbucket.io/courses/inls613_summer2019/lectures/08-lda_topic_modeling.pdf
http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf
https://ourcodingclub.github.io/tutorials/topic-modelling-python/
https://stackabuse.com/python-for-nlp-topic-modeling/
https://www.toptal.com/python/topic-modeling-python
https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.
https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html
The archive: http://roa.rutgers.edu/article/browse
https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/
https://www.edureka.co/blog/web-scraping-with-python/
https://www.webscrapingapi.com/python-web-scraping
Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/
NER
NLP Application: Named Entity Recognition (NER) in Python with Spacy
File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions
XML parsing
https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm
This summer I am sitting in on a computational linguistics course. It is the first instruction I have had about UNIX. Pretty Awesome.
This has required me to do some googling looking from terminal commands.
This is kind of a sketch of where I have been.
UNIX:
http://www.osxfaq.com/Tutorials/LearningCenter/
SSH:
http://kimmo.suominen.com/docs/ssh/
http://ss64.com/osx/
TERMINAL:
http://homepage.mac.com/rgriff/files/TerminalBasics.pdf
grep:
http://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/
http://en.wikipedia.org/wiki/Grep
http://www.computerhope.com/unix/ugrep.htm
Regular Expressions:
http://www.zytrax.com/tech/web/regex.htm
http://www.regular-expressions.info/tutorial.html
http://gnosis.cx/publish/programming/regular_expressions.html
RegEx and Unicode:
One of the issues that I have had with RegEx has been what is a natural class? i.e. [A-Z], [A-Za-z], [0-9], etc. As a linguist I deal a lot with IPA characters, subscripts, superscripts, unicode, and diacritics. How am I to define a natural class with these? Can I define a natural class based on the phonology of the language?
So I did some more searching:
http://unicode.org/reports/tr18/
http://unicode.org/reports/tr18/tr18-5.1.html
http://icu-project.org/docs/papers/iuc26_regexp.pdf
http://courses.ischool.berkeley.edu/i256/f06/papers/regexps_tutorial.pdf
http://wapedia.mobi/en/Regular_expression?t=5.
RegEx+PERL+Unicode:
http://perldoc.perl.org/perlretut.html
PERL:
http://www.enginsite.com/Library-Perl-Regular-Expressions-Tutorial.htm
http://www.cgi101.com/book/connect/mac.html
http://www.mactech.com/articles/mactech/Vol.18/18.09/PerlforMacOSX/index.html