marcdb
If you are a library data wrangler at some point you’ve probably wanted to stuff MARC data into a relational database so you can do queries across it. Perhaps your $vendor supports querying like this, but perhaps not. At any rate for some work I’ve been doing I’ve really needed to be able to get a feel for a batch of MARC authority data, in particular the data that Simon Spero has kindly made available.
So I created a little tool I’m calling marcdb which slurps in MARCXML or MARC and stuffs it into a relational database schema. The source for marcdb is available and you can install via the python cheeseshop with easy_install if you have it. As you can see from the README it lets SQLAlchemy and Elixir do the database talkin’. This results in a nice little python file that defines the schema in terms of Python classes. You ought to be able to use marcdb with any backend database (mysql, sqlite, postgres) that is supported by SQLAlchemy.
At any rate, the point of all this is to enable querying. So for example after I loaded Simon’s authority data I can do a query to see what the lay of the land is in terms of number of tags.
SELECT tag, COUNT(*) AS tag_count FROM data_fields GROUP BY tag ORDER BY tag_count DESC; tag | tag_count —–+———– 035 | 558727 670 | 496600 040 | 379999 010 | 379999 953 | 369625 906 | 272196 550 | 232544 150 | 217556 450 | 211067 952 | 185012 151 | 158900 451 | 143538 781 | 122490 043 | 92656 053 | 92404 675 | 42496 551 | 24797 667 | 14434 985 | 13725 680 | 10342 681 | 8873 410 | 7103 360 | 4126 073 | 3540 180 | 3000 019 | 1832 678 | 1311 580 | 857 480 | 808 260 | 753 185 | 501 510 | 369 485 | 262 042 | 260 500 | 259 016 | 243 585 | 192 400 | 147 682 | 134 710 | 132 979 | 107 530 | 93 430 | 82 665 | 44 182 | 36 482 | 8 969 | 4 181 | 4 555 | 4 581 | 4 455 | 4 582 | 3 481 | 3 052 | 3 411 | 2 155 | 2 751 | 2 014 | 2 050 | 2 856 | 1
Or, here’s a more complex query for determining the types of relationships found in See Also From Tracing fields.
SELECT subfields.value, count(*) AS value_count
FROM data_fields, subfields
WHERE data_fields.tag in ('500', '510', '511', '530', '548', '550', '551',
'555', '580', '581', '582', '585')
AND data_fields.id = subfields.id
AND subfields.code = 'w'
GROUP BY subfields.value
ORDER BY value_count
value | value_count
-------+-------------
g | 8438
nne | 1243
nnaa | 1083
a | 146
b | 140
nna | 8
bnna | 4
anna | 3
n | 2
nnnd | 2
nnnb | 1
(11 rows)
So most of the relations are ‘g’ which is for broader relations. I know MARC is kind of passé these days, but there’s a lot of it around in libraries, and it’s important to be able to make decisions about it–especially when converting it to more web-viable formats. I’d be interested in feedback if you get a chance to try it out.
October 11th, 2007 at 10:49 am
Thanks, this is nice and handy.
Since the goal is to run easy SQL query, perhaps it make sense to bend SQL rule a bit to use less tables and joins? right now the schema using four tables, if it’s reduced to one table:
(recordid, tag, position, indicator1,indicator2, subfield, value)
fol05731351, leader, ….. , 00755cam 22002414a 4500
fol05731351,008, …. 000107s2000 nyua 001 0 eng
fol05731351,245,1,0,1,a, Perl (Computer program language)
….
October 15th, 2007 at 6:08 am
I like this idea. Had you thought about what the queries would look like? Flattening like this is typical in data warehousing applications where the emphasis is more on reporting. There would be a lot of duplication of data, but I don’t see that as a particular problem. But it might make queries that want to treat the record as a unit a bit problematic. Thanks for the feedback.
November 5th, 2007 at 10:14 am
[...] headings in the LC authority file. You know how it is. Anyhow, I remembered that I’d used marcdb to import all of Simon Spiro’s authority data–so I fired psql and wrote a [...]