bs4 | The Journeyler

This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.

https://www.youtube.com/watch?v=RvCBzhhydNk

https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html

The archive: http://roa.rutgers.edu/article/browse

https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/

https://www.edureka.co/blog/web-scraping-with-python/

https://www.webscrapingapi.com/python-web-scraping

Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/

NER

Named Entity Recognition (NER) in Python with Spacy

File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions

XML parsing

https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Tag Archives: bs4

Scraping archives for OLAC