Skip to primary content
Skip to secondary content

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

The Journeyler

Main menu

  • Home
  • CV/Resume
  • Family
    • Katja
    • Hugh V
  • Location
    • Cartography
    • Geo-Tagging
    • GPS
  • Language Documentation
    • Linguistics
    • Digital Archival
  • Visiting Collections
    • Photography
    • Open Drafts
    • Posts to move to another website
  • Archives

Tag Archives: bs4

Scraping archives for OLAC

Posted on November 2, 2022 by Hugh Paterson III
Reply

This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.

https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html

The archive: http://roa.rutgers.edu/article/browse

https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/

https://www.edureka.co/blog/web-scraping-with-python/

https://www.webscrapingapi.com/python-web-scraping

Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/

NER

NLP Application: Named Entity Recognition (NER) in Python with Spacy

File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions

XML parsing

https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm

Posted in Other Journals | Tagged bs4, metadata, OLAC, Python | Leave a reply

Activity

June 2023
M T W T F S S
 1234
567891011
12131415161718
19202122232425
2627282930  
« May    

I’ve been saying

  • CEFR Research
  • LaTeX Journeys
  • Compiling a Proceedings in LaTeX
  • Collection aboutness in OLAC
  • Our First Concert
  • Zotero Pains… The “pre-print”
  • The swimmers
  • Building a discourse server
  • Of-ness in audio recordings
  • From Quality Metadata to User Experience Consistencies
  • Limited protection in IP law to encourage green innovation
  • Definitely a boy

Say What?

  • kristina Oma Cartwright on Carries-free Kids
  • kristina Oma Cartwright on Florence Beach
  • kristina Oma Cartwright on Oregon state Supreme Court
  • kristina Oma Cartwright on Craft day!
  • mom on Identifying as female

One should not consider the content on this website to be an official opinion of any company associated with me. These posts are solely my opinion.

Proudly powered by WordPress