marcdb

If you are a library data wrangler at some point you’ve probably wanted to stuff MARC data into a relational database so you can do queries across it. Perhaps your $vendor supports querying like this, but perhaps not. At any rate for some work I’ve been doing I’ve really needed to be able to get a feel for a batch of MARC authority data, in particular the data that Simon Spero has kindly made available.

So I created a little tool I’m calling marcdb which slurps in MARCXML or MARC and stuffs it into a relational database schema. The source for marcdb is available and you can install via the python cheeseshop with easy_install if you have it. As you can see from the README it lets SQLAlchemy and Elixir do the database talkin’. This results in a nice little python file that defines the schema in terms of Python classes. You ought to be able to use marcdb with any backend database (mysql, sqlite, postgres) that is supported by SQLAlchemy.

At any rate, the point of all this is to enable querying. So for example after I loaded Simon’s authority data I can do a query to see what the lay of the land is in terms of number of tags.

SELECT tag, COUNT(*) AS tag_count 
FROM data_fields 
GROUP BY tag 
ORDER BY tag_count DESC;

tag | tag_count 
—–+———– 
035 |    558727
670 |    496600
040 |    379999
010 |    379999
953 |    369625
906 |    272196
550 |    232544
150 |    217556
450 |    211067 
952 |    185012 
151 |    158900 
451 |    143538 
781 |    122490 
043 |     92656 
053 |     92404 
675 |     42496 
551 |     24797 
667 |     14434 
985 |     13725 
680 |     10342 
681 |      8873 
410 |      7103 
360 |      4126 
073 |      3540 
180 |      3000 
019 |      1832 
678 |      1311 
580 |       857 
480 |       808 
260 |       753 
185 |       501 
510 |       369 
485 |       262 
042 |       260 
500 |       259 
016 |       243 
585 |       192 
400 |       147 
682 |       134 
710 |       132 
979 |       107 
530 |        93 
430 |        82 
665 |        44 
182 |        36 
482 |         8 
969 |         4 
181 |         4 
555 |         4 
581 |         4 
455 |         4 
582 |         3 
481 |         3 
052 |         3 
411 |         2 
155 |         2 
751 |         2 
014 |         2 
050 |         2 
856 |         1

Or, here’s a more complex query for determining the types of relationships found in See Also From Tracing fields.

SELECT subfields.value, count(*) AS value_count
FROM data_fields, subfields
WHERE data_fields.tag in ('500', '510', '511', '530', '548', '550', '551',
  '555', '580', '581', '582', '585')
AND data_fields.id = subfields.id
AND subfields.code = 'w'
GROUP BY subfields.value
ORDER BY value_count

 value | value_count
-------+-------------
 g     |        8438
 nne   |        1243
 nnaa  |        1083
 a     |         146
 b     |         140
 nna   |           8
 bnna  |           4
 anna  |           3
 n     |           2
 nnnd  |           2
 nnnb  |           1
(11 rows)

So most of the relations are ‘g’ which is for broader relations. I know MARC is kind of passé these days, but there’s a lot of it around in libraries, and it’s important to be able to make decisions about it–especially when converting it to more web-viable formats. I’d be interested in feedback if you get a chance to try it out.

3 Responses to “marcdb”

  1. Xiaoming Liu Says:

    Thanks, this is nice and handy.

    Since the goal is to run easy SQL query, perhaps it make sense to bend SQL rule a bit to use less tables and joins? right now the schema using four tables, if it’s reduced to one table:

    (recordid, tag, position, indicator1,indicator2, subfield, value)

    fol05731351, leader, ….. , 00755cam 22002414a 4500
    fol05731351,008, …. 000107s2000 nyua 001 0 eng
    fol05731351,245,1,0,1,a, Perl (Computer program language)
    ….

  2. ed Says:

    I like this idea. Had you thought about what the queries would look like? Flattening like this is typical in data warehousing applications where the emphasis is more on reporting. There would be a lot of duplication of data, but I don’t see that as a particular problem. But it might make queries that want to treat the record as a unit a bit problematic. Thanks for the feedback.

  3. inkdroid » Blog Archive » more marcdb Says:

    [...] headings in the LC authority file. You know how it is. Anyhow, I remembered that I’d used marcdb to import all of Simon Spiro’s authority data–so I fired psql and wrote a [...]

Leave a Reply