Off-and-on over the past few years I’ve been helping out with the RepoData project, which is a dataset of location information for more than 30,000 archival repositories in the United States. The initial data collection was done by Eira, Ben, and Whitney over several years by aggregating information from over 100 sources. After this heavy lift I worked a bit with Hillel to help move the data from Excel spreadsheets (where they were initially created) to CSV, JSON and GeoJSON files, where they are managed now in GitHub.

The hope has been to make it easy for people to update this data, without needing to submit pull requests on Github. There is still work to do on that front. However, we were fortunate to have a pull request contribution from Charlie Macquarie, who had collected some data for California archives in the California Revealed project. This data gave us a chance to test how we would integrate contributions into the existing dataset. As we merged this data we noticed that we did end up with some duplicates.

We realized that we needed a process to ensure that we weren’t introducing duplicate entries, when people submitted new data via GitHub. Specifically we wanted a process that:

  • identifies potential duplicates by archive name, city and state
  • allows information to be merged from one record to another, when the new data had additional information
  • allows the deletion of duplicate records
  • could identify duplicates involving a particular contributor
  • could ignore duplicates where a PO Box was involved (the dataset is known to have these)
  • keeps the data sorted properly so that diffs were possible
  • relatively simple to run by a project contributor

These are some pretty specific requirements. After conferring with Hillel and Charlie I developed a small Python Terminal User Interface utility that uses Pandas and Rich to let a user manipulate the dataset using a set of keyboard commands: next, previous, move, delete, etc.

When you run the program you are presented with text interface where the identified duplicate records are laid out side by side. The user can choose to copy values from one record to another, delete records once they are no longer needed, and save when they are done.

In general I was pretty happy with Rich for displaying the data in the terminal. It seems mostly oriented to creating TUI displays, and (unless I misunderstood) the components for gathering input were pretty rudimentary, but seemed to get the job done.

We’ll see how well the TUI approach works in practice, but I like the fact that we don’t need to keep a web application online for doing this data-curation, or giving complicated instructions for using Google Sheets, Excel or something else. Running the tool is fairly easy, assuming you’ve have a modern Python environment with Poetry installed (big assumptions I guess).

A couple years ago I created a map using the geo-coordinates in the data. It lets you zoom in on locations, and even link to Google Street View to see what the location looks like. For example:

It’s actually kind of fun trying to find the archive where you land in Street View…

I’m thinking one way of moving the data curation process forward a bit would be to allow people to search this datai on the map, and to easily click on a link in the record detail to create a ticket in Github, if data needs to be updated. A similar link for adding an entry could also be provided that populates a GitHub issue. This will require someone to keep an eye on the issue tracker for records that need to be updated, but hopefully that won’t be too difficult. Maybe a view into the issue queue could go on the website as well as a reminder?

If you have thoughts or ideas please get in touch. And if the data is useful to you please download and use it, it’s available under a Open Data Commons Open Database License.


Update: I got some good feedback from Timothy who suggested that it might be worth checking out the Textual Python framework, which has some common ancestry with Rich, and provides more features for events and interactivity, which I found kind of rudimentary in Rich.