DSpace and Dataverse have a bug in Parsing OLAC XML

Posted on January 22, 2024 by Hugh Paterson III

I read tonight about a bug in Xoai a foundational library for DSpace and Dataverse which uses lxml lib for parsing.

Since the OLAC XML implementation of OAI-PMH requires the use of an XSI element it seems that the bug defined here https://github.com/DSpace/xoai/issues/67 and discussed here would apply https://github.com/gdcc/xoai/issues/141

The SIL archive and its two sided markets

Posted on February 19, 2013 by Hugh Paterson III

I have been thinking about the language data marketplace (exchange if one prefers), and the role of archives in a world where minority language speakers are also internet users and digital file consumers. In particular I have been thinking about SIL’s Language and Culture Archive and the economic model called a two sided market. So, SIL as “Partners in Language Development” seems to be well situated for analysis using the two sided market analysis (matching linguist and professionals with language development skills, and persons with language development skills with interested parties in developing their language). On the surface, it seems that the SIL archive would also benefit from being the center of exchange between these same two groups. This is the subject of one of my slides for an upcoming presentation, therefore I sketched out the interactions various SIL staff might have with the archive to see if I could diagram the social interactions around language data in SIL’s two sided market. To my surprise, the two sided nature of access to data in the archive is not supported, thereby blocking a data-centric archiving service. It makes me wonder what the perceived value of the archive really is, and if the perceived value is low, then why bother? What is the return on investment (ROI) for users on either side of the market?

I tried to summarize the relationships between the various clients of the archive in the following image.

Media and relationships among different roles in SIL projects.

DSpace and the Presentation Layer

Posted on November 30, 2011 by Hugh Paterson III

Drupal

Because I have been on the team doing the SIL.org redesign, I have been looking at the Open Source landscape looking at what is available to connect Drupal with DSpace data stores. We are planning on making DSpace the back-end repository, with another CMS running the presentation and interactive layers. I found a module which parses DSpace's XML feeds in development. However, this is not the only thing that I am looking at. I am also looking at how we might deploy Omeka. Presenting the entire contents of a Digital Language and Culture Archive, and citations for their physical contents is no small task. In addition to past content there is also future content. That is to say archiving is also not devoid of publishing - so there is also the PKP project [sic redundant]. (SIL also currently has a publishing house, whose content need CSV or version control and editorial workflows, which interact with archiving and presentation functions.)

Omeaka

Wally Grotophorst has a really good reflection on Omeaka and DSpace, I am not sure that it is current but it does present the problem space quite well. ^[1]Wally Grotophorst. 4 March 2008. DSpace And Omeka. iNODE: The weblog of Digital Programs and Systems at George Mason University Libraries. http://timesync.gmu.edu/wordpress/?p=485 . [Accessed: 26 … Continue reading Tom Scheinfeldt at Omeka also has a nice write up on why Omeka exists, titled "Omeka and It's peers". It is really important to understand Omeka's place in the eco system of content delivery to content consumers by qualified site administrators. ^[2] Tom Scheinfeldt. 21 September 2010. Omeka and It's peers. http://omeka.org/blog/2010/09/21/omeka-and-peers/ [Accessed: 26 November 2011] [Link] [Also Posted on Tom's Blog]

@Mire talks about What DSpace could learn from Omeka. ^[3] @Mire. 20 May 2010. What DSpace could learn from Omeka. http://www.facebook.com/notes/mire/what-dspace-could-learn-from-omeka/393758568767 . [Accessed: 26 November 2011] [Link]

Dspace Mailing list discussion discussing some DSpace technologies for mixing with OAI-ORE and Fedora, Omeka, and Drupal.

http://omeka.org/forums/topic/omeka-and-harvesting-from-dspace
http://omeka.org/forums/topic/import-to-dspace

References[+]

References
↑1	Wally Grotophorst. 4 March 2008. DSpace And Omeka. iNODE: The weblog of Digital Programs and Systems at George Mason University Libraries. http://timesync.gmu.edu/wordpress/?p=485 . [Accessed: 26 November 2011] [Link]
↑2	Tom Scheinfeldt. 21 September 2010. Omeka and It's peers. http://omeka.org/blog/2010/09/21/omeka-and-peers/ [Accessed: 26 November 2011] [Link] [Also Posted on Tom's Blog]
↑3	@Mire. 20 May 2010. What DSpace could learn from Omeka. http://www.facebook.com/notes/mire/what-dspace-could-learn-from-omeka/393758568767 . [Accessed: 26 November 2011] [Link]

Metadata Magic

Posted on August 10, 2011 by Hugh Paterson III

The company I work for has an archive for many kinds of materials. In recent times this company has moved to start a digital repository using DSpace. To facilitate contributions to the repository the company has built an Adobe AIR app which allows for the uploading of metadata to the metadata elements of DSpace as well as the attachement of the digital item to the proper bitstream. Totally Awesome.

However, one of the challenges is that just because the metadata is curated, collected and properly filed, it does not mean that the metadata is embedded in the digital items uploaded to the repository. PDFs are still being uploaded with the PDF’s author attribute set to Microsoft-WordMore about the metadata attributes of PDF/A can be read about on pdfa.org. Not only is the correct metadata and the wrong metadata in the same place at the same time (and being uploaded at the same time) later, when a consumer of the digital file downloads the file, only the wrong metadata will travel with the file. This is not just happening with PDFs but also with .mp3, .wav, .docx, .mov, .jpg and a slew of other file types. This saga of bad metadata in PDFs has been recognized since at least 2004 by James Howison & Abby Goodrum. 2004. Why can’t I manage academic papers like MP3s? The evolution and intent of Metadata standards.

So, today I was looking around to see if Adobe AIR can indeed use some of the available tools to propagate the correct metadata in the files before upload so that when the files arrive in DSpace that they will have the correct metadata.

The first step is to retrieve metadata from files. It seems that Adobe AIR can do this with PDFs. (One would hope so as they are both brain children of the geeks at Adobe.) However, what is needed in this particular set up is a two way street with a check in between. We would need to overwrite what was there with the data we want there.
However, as of 2009, there were no tools in AIR which could manipulate exif Data (for photos).
But it does look like the situation is more hopeful for working with audio metadata.

One way around the limitations of JavaScript itself might be to use JavaScript to call a command-line tool or execute a python, perl, or shell script, or even use a library. There are some technical challenges which need bridged when using these kinds of tools in a cross-platform environment. (Anything from flavors of Linux to, OS X 10.4-10.7 and Windows XP – Current.) This is mostly because of the various ways of implementing scripts on differnt platforms.

The technical challenge is that Adobe AIR is basically a JavaScript environment. As such there are certain technical challenges around implementation of command-line tools like Xpdf from fooLabs and Coherent PDF Tools or Phil Harvey’s ExifTool, Exifv2, pdftk, or even TagLib. One of the things that Adobe AIR can do is call an executable via something called actionscript. There are even examples of how to do this with PDF Metadata. This method uses PurePDF, a complete actionscript PDF library. Actionscript is powerful in and of itself, it can be used to call the XMP metadata of a PDF, Though one could use it to call on Java to do the same “work”.

Three Lingering Thoughts

Even if the Resource and Metadata Packager has the abilities to embed the metadata in the files themselves, it does not mean that the submitters would know about how to use them or why to use them. This is not, however, a valid reason to not include functionality in a development project. All marketing aside, an archive does have a responsibility to consumers of the digital content, that the content will be functional. Part of today’s “functional” is the interoperability of metadata. Consumers do appreciate – even expect – that the metadata will be interoperable. The extra effort taken on the submitting end of the process, pays dividends as consumers use the files with programs like Picasa, iPhoto, PhotoShop, iTunes, Mendeley, Papers, etc.
Another thought that comes to mind is that When one is dealing with large files (over 1 GB) It occurs to me that there is a reason for making a “preview” version of a couple of MB. That is if I have a 2 GB audio file, why not make 4 MB .mp3 file for rapid assessment of the file to see if it is worth downloading the .wav file. It seems that a metadata packager could also create a presentation file on the fly too. This is no-less true with photos or images. If a command-line tool could be used like imagemagick, that would be awesome.
This problem has been addressed in the open source library science world. In fact a nice piece of software does live out there. It is called the Metadata Extraction Tool. It is not an end-all for all of this archive’s needs but it is a solution for some needs of this type.

The Journeyler

A walk through: Life, Leadership, Linguistics, Language Documentation, WordPress, and OS X (and a bit of Marketing & Business Administration)

Tag Archives: DSpace