This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.
https://www.youtube.com/watch?v=RvCBzhhydNk
https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html
The archive: http://roa.rutgers.edu/article/browse
https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/
https://www.edureka.co/blog/web-scraping-with-python/
https://www.webscrapingapi.com/python-web-scraping
Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/
NER
File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions
XML parsing
https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm