browsertricky
Sometimes I want to archive a web page or website using browsertrix-crawler but I always have to go to the documentation to look up the specifics of the Docker incantation. Once I get something working I often want a record of how I ran the crawl so I can remember how it was created. I’ve done this enough times that I decided to create a little shell script and directory structure for the configuration files.
If you’d like to try this approach, install Git and Docker if you don’t already have them, and then:
$ git clone https://github.com/edsu/browsertricky.git
$ cd browsertricky
$ ./browsertricky example
To test your new web archive of the https://example.com website go to https://replayweb.page and load the WACZ file that was
created at collections/example/example.wacz
.
🎉 ✨ 🦄 🛸
Ok, that’s not a terribly interesting example (pun intended). But you can use the example config as a template for another configuration to archive another site. To do that you’ll need to:
$ cp config/example.yaml config/mysite.yaml
Edit the config/mysite.yaml
adding information about a
site you would like to archive:
- Change
collection
fromexample
tomysite
, which will change where your WACZ is written in thecollections
directory - Change the
seeds
list to include a new URL likehttps://mysite.com
And run it!
$ ./browsertricky mysite
If you open http://localhost:9037 while the crawl is underway you should see a screencast of the browser.
Read the browsertrix-crawler documentation for all the options you can put in your YAML configuration files. There are quite a few!
PS. I should really adapt this to work with podman too… It was updated to prefer
podman to docker if it is available.
PPS. Check out browsertrix-cloud if you don’t want to work on the command line and would rather use a web application to do it.