Amidst the flurry of commit messages and the like on the simile development discussion list I happened to see the Simile Project includes a RDFizer project which has a component called oai2rdf.

oai2rdf is a command line program that happens to use Jeff Young’s OAIHarvester2 and some XSLT magic to harvest an entire oai-pmh archive and covert it to rdf.

  % oai2rdf.sh http://cogprints.ecs.soton.ac.uk/perl/oai2 cogprints

This will harvest the entire cogprints eprint archive and convert it on the fly to rdf which is saved in a directory called cogprints. Just in case you are wondering–yes it handles resumption tokens. In fact you can also give it date ranges to harvest, and tell it to only harvest particular metadata formats. By default it actually grabs all possible metadata formats.

As part of my day job I’ve been looking at some rdf technologies like jena and while there are lots of chunks of rdf around on the web to play with oai2rdf suddenly opens up the possibilities quite a bit.

Getting oai2rdf up and running is pretty easy. First get the oai2rdf code:

  svn co http://simile.mit.edu/repository/RDFizers/oai2rdf/ oai2rdf

Next make sure you have maven. If you don’t have it maven is very easy to install. Just download, unpack, and make sure the maven/bin directory is in your path. Then you can:

  mvn package

The magic of maven will pull down dependencies and compile the code. Then you should be able to run oai2rdf. Art Rhyno has been talking about the work the Simile folks are doing for quite a while now, and only recently have I started to see what a rich set of tools they are developing.