marcdb

If you are a library data wrangler at some point you’ve probably wanted to stuff MARC data into a relational database so you can do queries across it. Perhaps your $vendor supports querying like this, but perhaps not. At any rate for some work I’ve been doing I’ve really needed to be able to get a feel for a batch of MARC authority data, in particular the data that Simon Spero has kindly made available.

So I created a little tool I’m calling marcdb which slurps in MARCXML or MARC and stuffs it into a relational database schema. The source for marcdb is available and you can install via the python cheeseshop with easy_install if you have it. As you can see from the README it lets SQLAlchemy and Elixir do the database talkin’. This results in a nice little python file that defines the schema in terms of Python classes. You ought to be able to use marcdb with any backend database (mysql, sqlite, postgres) that is supported by SQLAlchemy.

At any rate, the point of all this is to enable querying. So for example after I loaded Simon’s authority data I can do a query to see what the lay of the land is in terms of number of tags.

SELECT tag, COUNT(*) AS tag_count 

FROM data_fields 

GROUP BY tag 

ORDER BY tag_count DESC;

 

tag | tag_count 

-----+-----------
 
035 |    558727

670 |    496600

040 |    379999

010 |    379999

953 |    369625

906 |    272196

550 |    232544

150 |    217556

450 |    211067
 
952 |    185012
 
151 |    158900
 
451 |    143538
 
781 |    122490
 
043 |     92656
 
053 |     92404
 
675 |     42496
 
551 |     24797
 
667 |     14434
 
985 |     13725
 
680 |     10342
 
681 |      8873
 
410 |      7103
 
360 |      4126
 
073 |      3540
 
180 |      3000
 
019 |      1832
 
678 |      1311
 
580 |       857
 
480 |       808
 
260 |       753
 
185 |       501
 
510 |       369
 
485 |       262
 
042 |       260
 
500 |       259
 
016 |       243
 
585 |       192
 
400 |       147
 
682 |       134
 
710 |       132
 
979 |       107
 
530 |        93
 
430 |        82
 
665 |        44
 
182 |        36
 
482 |         8
 
969 |         4
 
181 |         4
 
555 |         4
 
581 |         4
 
455 |         4
 
582 |         3
 
481 |         3
 
052 |         3
 
411 |         2
 
155 |         2
 
751 |         2
 
014 |         2
 
050 |         2
 
856 |         1

Or, here’s a more complex query for determining the types of relationships found in See Also From Tracing fields.

SELECT subfields.value, count(*) AS value_count
FROM data_fields, subfields
WHERE data_fields.tag in ('500', '510', '511', '530', '548', '550', '551',
  '555', '580', '581', '582', '585')
AND data_fields.id = subfields.id
AND subfields.code = 'w'
GROUP BY subfields.value
ORDER BY value_count

 value | value_count 
-------+-------------
 g     |        8438
 nne   |        1243
 nnaa  |        1083
 a     |         146
 b     |         140
 nna   |           8
 bnna  |           4
 anna  |           3
 n     |           2
 nnnd  |           2
 nnnb  |           1
(11 rows)

So most of the relations are ‘g’ which is for broader relations. I know MARC is kind of passé these days, but there’s a lot of it around in libraries, and it’s important to be able to make decisions about it–especially when converting it to more web-viable formats. I’d be interested in feedback if you get a chance to try it out.