cultural heritage – Mensch.Maschine.Kultur

It could all be so simple: cultural heritage institutions and other public sector bodies provide high-quality data on a large scale and, wherever possible, under a permissive licence such as CC0 or Public Domain Mark 1.0. This is in line with the idea that cultural heritage institutions are funded by taxes, therefore everyone should also benefit from their services and products; in the case of data, innovation, research and of course private use should be possible.

However, we live in times of large language models and exploitative practices, especially of US-american big tech companies. Here, data are extracted from the web on a large scale and processed into proprietary large language models. These companies are not only the drivers of innovation, but also set themselves apart from research institutions, for example, by having specifically trained data sets at their disposal as well as exceptional computing power and the best-paid positions for developers of algorithms; all these elements are expensive ingredients of a recipe for success in the face of limited competition.

One of the weaknesses of ChatGPT – and presumably of GPT-4 – is its lack of reliability. This weakness results from the inability of purely stochastic language models to distinguish between fact and fiction; but also from a lack of data. Especially with regard to “hallucinated” literature references, bibliographic data from libraries are very attractive for building large language models. Another problem is the lack of high-quality text data. According to a recently published study, high-quality text data will be exhausted before the year 2026; this is mainly due to the lack of etiquette and proper spelling on the internet. But who, if not the libraries, have huge stocks of high-quality text data? Almost all the content available here has passed through a quality filter called “publishing houses”. One may be divided about the intellectual quality of the books; but linguistically and orthographically, everything that was printed until the end of the 20th century (i.e. before the advent of self-publishing) is of very good quality.

Finally, dear money: inflation is back, the low-interest phase is gone, the first Silicon Valley bank went bankrupt. Many companies based there will soon need fresh money; there will soon be monetisation to generate profits. New and more capable models will soon be created from products (such as ChatGPT) that were previously offered free of charge, providing demand-driven services in exchange for payment.

Should cultural heritage institutions as public entities serve the maximisation of the profits of a few companies by providing expensive and resource-intensive (and tax-funded) data for free? The answer has to be differentiated and therefore complicated. Of course, data should also be made available under permissive licences, as has been the case up to now. A dual strategy can certainly be used here. On the one hand, data made available via interfaces such as OAI-PMH or IIIF continue to be accessible under CC0 licence or or Public Domain Mark 1.0; technical access restrictions can prevent large-scale data extraction, e.g. by controlling IP addresses or download maxima. On the other hand, specific data publications can be provided that bundle individual data sets to enable research and innovation; such offerings are protected as databases for 15 years, and here licences can be used that contain a “NC” (non-commercial) mark and make such data usable for research and innovation. As an example, the Prussian Cultural Heritage Foundation uses such a licence (CC-BY-NC-SA) for the digital representation of one of its masterpieces, and the (not so easy to use) 3D scan is also freely available under this licence (download here).

Interestingly, the European Union anticipated the case described above in the Data Governance Act and included a relevant set of instruments. There is a chapter on the use of data provided by public sector bodies (Chapter II, Article 6), which regulates the provision of data in exchange for fees. It states that public sector bodies may differentiate the fees they charge between private users, small and medium-sized enterprises (SMEs) and start-ups on the one hand and larger corporations on the other, which don’t fall under the former definition. In this way, a possibility for differentiation within the framework of commercial users is created, whereby the fees have to be oriented at the costs of the infrastructure to provide data. This is something rather atypical in the European legal system, since the principle of equal treatment applies. Cultural heritage institutions thus have EU Commissioner for Competition Margrethe Vestager on their side, who presented the Data Governance Act in 2020 (that is applicable from 24 September 2023, by the way). Vestager is also Executive Vice President of the European Commission for a Europe Fit for the Digital Age and has imposed more than 15 billion Euros in antitrust fines in her first five years in office. So the enforcing political will seems to be there.

In case of doubt, this will be necessary. Licences like CC-BY-SA-NC effectively prevent the use of public data for commercial exploitation in large language models. But since the creators of large language models are moving around in a minefield regarding copyright, and in the case of other models, a stock photo agency or other rights holders have already filed copyright lawsuits, one must unfortunately doubt that they will show consideration in the future. Of course, the relevant court decisions remain to be seen in the pending cases. Even with reverse engineering, it is not easy to prove which data sets have been incorporated into a large language model; therefore, a kind of circumstantial evidence would have to be provided. In the medium and long term, it therefore seems more sensible to focus on establishing validation processes and standards that have to be implemented prior to publishing AI models. This includes the disclosure of the training material and process, its evaluation by experts, code audits, but also a reversal of evidence with regard to the licensing of the data material used. Making such procedures an obligatory part of the approval of commercial AI applications is then actually the task of the European Union.

Finally, another way is to publish cultural heritage data in a separate Data Space for Cultural Heritage; the tender for this Data Space was launched last autumn and is part of the European Union’s Data Act. To what extent this Data Space will grant full data sovereignty to cultural heritage institutions and thus the possibility to control access to data publications remains to be seen.

Since the release of the ChatGPT dialogue system in November 2022, the societal debate about artificial intelligence (AI) has gained significant momentum and has also reached cultural heritage institutions (such as libraries, archives, and museums). The main challenge is to assess how powerful such large language models (LLMs) are in general, and Generative Pre-trained Transformers (GPTs) in particular. For the cultural heritage sector, the ChatGPT chat bot prototype reveals a whole range of possible uses: producing text summaries or descriptions of artworks, generating metadata, writing computer code for simple tasks, assisting with subject indexing and keyword indexing, or helping users find resources on the websites of cultural heritage institutions.

Undoubtedly, ChatGPT’s strengths lie in the generation of text and associated tasks. As “statistical parrots,” as these large language models were called in a much-discussed 2021 paper, these language models can predict on a stochastic basis what the next words of a snippet of text will look like. In this context, the ChatGPT use case has been trained – as a text-based dialogue system – to provide answers at any rate. This property of the chat bot points directly to one of the central weaknesses of the model: In case of doubt, ChatGPT provides untrue statements in order to maintain the dialogue. Since large language models are, after all, only applications of artificial intelligence and have no knowledge of the world, they cannot per se distinguish between fact and fiction, social construction and untruth. The fact that ChatGPT “hallucinates” (as the common anthropomorphizing term goes) when in doubt and also e.g., invents literature references, damages of course the reliability of the system – and it points to the great strength of libraries in providing authoritative evidence.

On the other hand, a strength of such systems is that they can excellently reproduce discourses and are therefore able to classify individual texts or larger text corpora and to describe their content in an outstanding way. This shows great potential, especially for libraries: Up to now, digital assistants that support the indexing of books have at best worked with statistical methods such as tf-idf, or with deep learning. Such approaches could be complemented through the use of topic modeling. The latter method generates a stochastically modelled text sequence that describes the content of a work or the topics it deals with. The challenge for users so far has been to interpret this collection of words and assign a coherent label to it – and this is exactly what ChatGPT does excellently, as several researchers have confirmed. Since this massively improves and facilitates the labelling of texts, this is certainly one of the most probable use cases for AI in libraries, and exactly the field on which the sub-project 3 “AI-supported content analysis and subject indexing” of the project “Human.Machine.Culture” focuses. By contrast, simple programming tasks such as creating a bibliographic record in a specific format or transforming a record from MARC.xml to JSON are in need of improvement; ChatGPT does not always perform such tasks reliably, as a recent experiment showed.

ChatGPT, as one of the most powerful text-based AI applications currently available, underlines the potential benefits of such models. At the same time, however, it also highlights the risks associated with the use of such applications: So far, only U.S.-American big tech companies are able to train such powerful models, make them accessible, and develop later onwards models optimized through reinforcement learning for specific tasks – with the clear goal of monetization. In addition, generative AI systems bring with them a number of ethical issues, as they require large masses of text that have so far been taken from the Internet and thus a place where not all people interact politely and with all etiquette. For example, a recent study has underlined that large language models reproduce stereotypes by associating the terms “Muslims” and “violence”. Moreover, toxic content in the language models have to be labeled as such, an operation that is being carried out by underpaid workers; this underlines again the ethical dubiousness of the process of establishing such models.

Finally, the fact that these models have been trained almost exclusively based on 21st century textual material available on the Internet has to be underlined. By contrast, sub-project 4 “Data provision and curation for AI” of the project “Human.Machine.Culture” concentrates on the provision of curated and historical data from libraries for AI applications. Finally, the deployment of large langage models points to very fundamental questions: Namely, what role the cultural heritage of all humanity should play in the future and what effect cultural heritage institutions like libraries, archives and museums may have on their establishment; and what influence the texts generated by large language models will have on our contemporary culture as such.

Tag Archive for: cultural heritage

On the Use of Licences in Times of Large Language Models

On the Use of ChatGPT in Cultural Heritage Institutions