Some things to consider when deciding whether to start building with "AI" in libraries and archives.

I was asked to participate in a panel at work about AI. I initially declined, but once it became clear that I would be allowed to get on my soapbox and rant for 15 minutes I agreed. Below are my notes and some slides. This was not a fun post to write or present. I’m sure it rubbed some people the wrong way, and I am genuinely sorry for that.

I’ve done a little bit of work with AI, like downloading some models from Hugging Face as part of named entity recognition experiments, running Whisper on some interviews that I wanted a transcript for, testing out Google’s new file identification tool, and writing a bot to use the OpenAI API to generate some fake diary entries from some random words a friend of mine was publishing.

But as I listened to my CPU fan spinning, all this experience has really done is reinforce some concerns I have as a software developer about the AI industry, and the application of these technologies in libraries and archives.

It’s not that I don’t think these tools and methods have some use in the cultural heritage sector, but I do think we need to think carefully and critically about them. I’m sure you will be familiar with at least a few of these topics, but I thought it could be useful to bring them together, with links to learn more, and also close each one out with some tactics for addressing them.

If you take nothing else from this presentation I’d like it to be that despite what the “boomers” and “doomers” would like you to believe, the ascendency of AI is not inevitable, and we have decisions to make. What we decide to do will have a big impact on how these technologies get deployed.

Much of this perspective is informed by my own interest in Science and Technology Studies, which encourages an understanding of technology in its social and historical context, and to remember that “it could be otherwise” (Woolgar, 2014). It was also informed by reading Dan McQuillan’s book Resisting AI (highly recommended).

Despite the recent surge of interest in Large Language Models and Generative AI tools (ChatGPT, DALL-E, etc), AI is part of long history of computer automation, which is continuing to transform our work, and our lives. I have tended to prefer the term Machine Learning (ML) to AI, because it has more specificity when discussing the recent application of statistical algorithms to increasingly large datasets, using increasingly large computing environments. But I’ve also come to appreciate that the term AI is actually useful for talking about this longer trajectory of automation, stretching back to the beginnings of modern computing. Looking at this technology as part of a very long project, involving a shifting set of actors is important.

However the areas that I’m going to touch on here refer mostly to recent developments with Large Language Models, although some of them are relevant for more specialized forms machine learning as well. There are five points of concern, and for each area I’ll include a tactic for addressing it in libraries and archives.

Bias

ML models are built using data. Recent advances in Deep Learning have largely been the result of applying decades old algorithms to increasingly large amounts of data collected from the web. The data that is used to train these models is significant because the models necessarily reflect the data that was used to create them. Unfortunately corporations are increasingly tight lipped about the data that has been used to train these models (more on that next).

Some commonly used datasets like CommonCrawl represent significantly large collections of web data, but the web is a big place, and decisions have gone into what websites were collected. CommonCrawl is not representative of the web as a whole. Furthermore LLMs encode biases that are present in today’s society. Blindly using and becoming dependent on LLMs risks further encrusting these biases and participating in systemic racism.

As LLMs are used to generate more and more web content there is also a risk that this data is again collected and used to train future models. This process has been called Model Collapse and has been shown to lead to a process of forgetting. OpenAI launched a tool for identifying content generated with an LLM and had to shut it down 6 months later because it didn’t work, and its not clear that it can even be done with reliability. What would it mean to only train these models with pre-2023 data?

Tactic: When evaluating an AI tool always see if you can identify what data has been used to train the model(s). How has it been “cleaned” or shaped? How is it updated?

Intellectual Property

Since LLMs have been built with data collected from the web this includes many types of content, from openly licensed datasets designed to be shared, to copyrighted books like those found in the books1,2,3 datasets, which are rumored to have been assembled from shadow libraries like Library Genesis and SciHub. Over the last year we’ve seen several lawsuits including from the Authors Guild challenging OpenAI’s use of copyrighted materials in building their GPT models.

In some ways these types of lawsuits are not new to the web. Napster was challenged by the Recording Industry Association of American; Google Books was sued by the Authors Guild in the mid 2000s; the Internet Archive has been recently sued over its Open Library platform. But what makes LLMs a bit different is the way they transform the content they’ve collected, rather than making it available verbatim. The US Copyright Office published a notice of inquiry last year to gather information about the use of copyrighted materials in AI tools, which we can expect to hear more about this year.

But this is not just an issue for blatantly pirated material.

The New York Times is also suing because of how millions of their openly published news stories were used by OpenAI to train their models, without a license. OpenAI is in the midst of trying to negotiate licensing contracts after the fact with many big players.

The way LLMs function represents a big shift in how the web ecosystem has evolved. Web search engines like Google crawl web pages to index them, and provide users with search results that link back to the original website. Similarly, social media platforms have provided a place to discuss web content by sharing links to it, driving other users to the web publisher.

In the LLM paradigm users never leave the ChatGPT interface, and the original publisher is completely cut out of the virtuous circle. LLMs are enclosing the web commons, and threaten to choke off the very sources of content that they used. Web publishers will lose the ability to understand how their content is being used.

Some web publishers have chosen to tell LLM bots to stop using robots.txt. Not all the bots collecting data from the web for LLMs will respect robots.txt files. In one experiment Ben Welsh found that 54% of news publishers (628 out of 1156) have decided to block OpenAI, Google AI, or CommonCrawl.

Tactic: What content should we make available to Generative AI tools. What would our donors want?

Verifiability

One of the reasons why ChatGPT doesn’t link to websites as citations is that it doesn’t know what to link to. In LLMs the neural network doesn’t record information about where a particular piece of data came from. As LLMs get integrated into more traditional search tools the challenge is to make generated text verifiable in the sense that the results include in-line citations, which should support the statement that they are used in.

Verifiability is important for understanding when generated content is out of alignment with the world, a so called “hallucination”. It’s also important for explaining why the model generated the response it did, when trying to debug why some interaction went wrong. Explainability is an active research area in the ML/AI community, and it’s not clear that given the model size and the size of the training data, whether the models can be made explainable, because at a fundamental level we don’t understand why they work. Generative AI applications that include citations have been shown to be unreliable, and provide a false sense of security.

The lack of explainability in LLMs presents real problems for libraries and archives whose raison d’être is to provide users with documents, whether they are books, maps, photographs, sound recordings, films, letters, etc. We describe these documents, and preserve these documents, in order to provide access to them, so that users can derive meaning from them. If we use an LLM to generate a response to a query or prompt, and we can’t back up the response with citations to these documents, this a problem.

This lack of verifiability is starting to be a problem for Wikipedia too.

Tactic: Library and archives professionals have a role in evaluating how AI tools cite documents as evidence.

Work

Part of the value proposition behind recent AI tools like GitHub’s Copilot, ChatGPT or DALL-E is that they democratize access to some skill whether it be writing code, authoring news stories, or creating illustrations. But is it democratic to systematically undermine creative workers, by stealing their content without having asked to use it in the first place?

When you make a decision to use these tools you are potentially replacing a person’s skill with a service. Furthermore you are binding your own organization to the whims of a corporation which would like nothing better than for you to divest of your organization’s expertise and become completely dependent on their service. It’s a trap.

If the past is any guide, we can also expect that skilled creative jobs will be replaced with lower paid jobs that involve mundane cleaning of the messes that have been made by automation. Or in the words of screenwriter C. Robert Cargill (quoted in the previous link):

The immediate fear of AI isn’t that us writers will have our work replaced by artificially generated content. It’s that we will be underpaid to rewrite that trash into something we could have done better from the start. This is what the WGA is opposing and the studios want.

LLMs like ChatGPT are built using a technique called Reinforcement Learning with Human Feedback (RLHF). The important part here is the human feedback. Who is providing this feedback? Are they users of the system? What types of systematic biases does this training introduce? Are they lower paid “ghost workers”?

Tactic: When evaluating the use of AI tools involve the people whose work is impacted in the decision making and implementation.

Sustainability

Probably the most troubling aspect to the latest wave of AI technology is their environmental impact. Recent advances in LLMs were not achieved through a better understanding of how neural networks work, but by using existing algorithms with massive amounts of data and compute resources. This training can takes months of time, and needs to be repeated to keep models up to date.

Apparently the initial training of GPT-4 took $100 million. The training relies on Graphical Processor Units (GPU) which are faster than CPUs for the types of computation that LLMs demand, but require up to four times as much energy to run. Data centers require water to cool, sometimes in environments where it is scarce. This isn’t just a problem for training models, it’s a bigger problem for querying them which has been estimated to be 60-100 times more in terms of energy utilization. Another problem lurking here is the lack of data from data centers that provides transparency about what is going on.

Is this the really the right direction for us to be headed as we are trying to reduce energy costs globally to limit global warming?

The tech industry is incentivized to try to make AI infrastructures more efficient. But Jevons Paradox will likely hold: technological progress increases the efficiency with which a resource is used, but the falling cost of use induces increases in demand enough that the resource use is increased.

My links runneth over:

Tactic: Libraries and archives should be looking for ways to reduce energy consumption not increase it.

Security and Privacy

Generative AI is a dual use technology. Experts are increasingly worried that it will be used to create disinformation as well as fake interactions online. We’ve had court cases where filings made by lawyers contained citations to cases that didn’t exist. AI generated voice robo-calls have been made illegal because of how AI tools were used to impersonate Biden’s voice. Bad actors can manipulate images and video to target specific groups because the tools are more powerful and accessible. There are possible ways to mitigate this by using trusted sources of information and provable ways of sharing the provenance of media.

Since the mechanics of how LLMs generate content are not explainable they are susceptible to attacks like what Simon Willison calls prompt injection. This is where a prompt is crafted to subvert the original design of the system to generate an intended response. This has serious ramifications for the use of LLM technology as glue between other automated systems. Indeed this was recently demonstrated by researchers using OpenAI and Google APIs to execute arbitrary code, and exfiltrate personal information.

While its not great to conflate privacy with security, I’m running out of time, and it’s important to note that privacy is also a problem. As LLM APIs are deeply integrated into applications, data will flow from one context into another. For example Docusign and Dropbox recently announced that they were integrated OpenAI into their products. When enabled your data will flow to OpenAI who may or may not use it to further train their models.

Tactic: support legislation that gives users agency over their data and practices that help ensure authenticity and provenance.

References

Woolgar, S. (2014). Struggles with representation: Could it be otherwise? In Representation in scientific practice revisited. The MIT Press. https://doi.org/10.7551/mitpress/9780262525381.003.0018