Skip to primary content
Skip to secondary content

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

The Journeyler

Main menu

  • Home
  • CV/Resume
  • Family
    • Katja
    • Hugh V
  • Location
    • Cartography
    • Geo-Tagging
    • GPS
  • Language Documentation
    • Linguistics
    • Digital Archival
  • Visiting Collections
    • Photography
    • Open Drafts
    • Posts to move to another website
  • Archives

Tag Archives: bs4

Scraping archives for OLAC

Posted on November 2, 2022 by Hugh Paterson III
Reply

This post is a set of resources I am compiling to create a scrape of a language archive to create a Static OLAC feed.

https://www.youtube.com/watch?v=RvCBzhhydNk

https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html

The archive: http://roa.rutgers.edu/article/browse

https://www.geeksforgeeks.org/how-to-build-web-scraping-bot-in-python/

https://www.edureka.co/blog/web-scraping-with-python/

https://www.webscrapingapi.com/python-web-scraping

Text Extraction
https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
https://betterprogramming.pub/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

OCR
https://www.javatpoint.com/how-to-read-contents-of-pdf-using-ocr-in-python
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
https://pypi.org/project/ocrmypdf/
https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://stackabuse.com/applying-ocr-to-a-scanned-pdf-in-python-using-borb/

NER

Named Entity Recognition (NER) in Python with Spacy

File Type Detection
https://github.com/ahupp/python-magic
https://www.geeksforgeeks.org/determining-file-format-using-python/
https://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions

XML parsing

https://pypi.org/project/defusedxml/
https://www.tutorialspoint.com/python/python_xml_processing.htm

Posted in Other Journals | Tagged bs4, metadata, OLAC, Python, R-90 | Leave a reply

Activity

May 2025
M T W T F S S
 1234
567891011
12131415161718
19202122232425
262728293031  
« Jan    

I’ve been saying

  • Chasing subsets
  • New mouse buttons
  • Moving Apple notes
  • Academic Heritage in MARC records
  • Converting DC Subjects to Schema.org
  • Language Documentation Gear
  • Serials, MARC Records and RDA Core
  • Font Modulator
  • OLAC CMS options via XML
  • OLAC Collection Description and Linked Data Terms
  • Zotero Plugins
  • OLAC and User Tasks

Say What?

  • David Clews on German Waters
  • Jeff Pitts on Kinder Eier
  • Jeff on Plasticification of soil
  • Thoughts on file formats and file names in language documentation projects and archiving | The Journeyler on The Workflow Management for Linguists
  • Hugh Paterson III on Types of Linguistic Maps: The Mapping of linguistic Features and Researcher Interactivity

One should not consider the content on this website to be an official opinion of any company associated with me. These posts are solely my opinion.

Proudly powered by WordPress

© 2005-2025 Hugh Paterson III All Rights Reserved.
By submitting a comment here you grant this site a perpetual license to reproduce your Words, Name & Website URL in attribution.
Details of your viewing experience maybe retained and used. -- Copyright notice by Blog Copyright