Tag Archive for: Large Language Models

On Objectivity – and the Bridge to Truth

Statistics are held in high regard. Although the saying goes like “do not trust any statistics you did not fake yourself”, they are often regarded as a prime example of objectivity based on the foundation of large datasets. This view is taken to the extreme when it comes to machine learning: machine learning models are statistical learners. A recently published research article criticizes this view: “the mythology surrounding ML presents it—and justifies its usage in said contexts over the status quo of human decision-making—as paradigmatically objective in the sense of being free from the influence of human values” (Andrews et al. 2024).

The fact that machine learning is seen as an extreme case of objectivity has its origins in the 19th century. Back then, the foundations of our current understanding of objectivity were laid. Human (and fallible) subjectivity was contrasted with mechanical objectivity. At that time, machines were considered to be free from willful intervention, which was seen as the most dangerous aspect of subjectivity (Daston / Galison 2007). To this day, machines – be they cameras, sensors or electronic devices, or even the data they produce – have become emblematic for the elimination of human agency and the embodiment of objectivity without subjectivity. These perceptions persist, and it becomes necessary to explain why common sense continues to attribute objectivity and impartiality to data, statistics and machine learning.

The debate of the 19th century has its revenant today in the discussion about biases. The fact that every dataset contains statistical distortions is obviously not compatible with the attribution of objectivity, which is supposed to be inherent in particular in large datasets. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size becomes more important. On the other hand, “large” does not mean “all”; rather, one must be aware of the universe covered by the data. Statistical inference, i.e. conclusions drawn from data about the population as a whole, cannot be easily applied because the datasets are not established to ensure representativeness (Kitchin 2019). A recent article states with regard to biases: “Data bias has been defined as ‘a systematic distortion in the data’ that can be measured by ‘contrasting a working data sample with reference samples drawn from different sources or contexts.’ This definition encodes an important premise: that there is an absolute truth value in data and that bias is just a ‘distortion’ from that value. This key premise broadly motivates approaches to ‘debias’ data and ML systems” (Miceli et al. 2022). What sounds like objectivity and ‘absolute truth value’ because it is based on large datasets, statistics and machine learning models is not necessarily correct, because if the model is a poor representation of reality, the conclusions drawn from the results may be wrong. This is also the reason why Cathy O’Neil in 2016 described an algorithm as “an opinion formalized in code” – it does not simply offer objectivity, but works towards the purposes and goals for which it was written.

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty)

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty). Staatliche Museen zu Berlin, Egyptian Museum und Papyrus Collection. Public Domain Mark 1.0
A historical visualisation of scientists communicating with each other and harmonising their views in the sense of the community being above the individuum?

The fact that scientists – and the machine learning community in particular – still adhere to the concept of objectivity and the objective nature of scientific knowledge is due to the fact that the latter is socially constructed because it is partly derived from collective beliefs held by scientific communities (Fleck 1935/1980). Beyond the activity of the individual researcher, the embedding of research results within a broader scientific discourse shows that scientific research is a collective activity. Much of what is termed ‘science’ is based on social practices and procedures of adjudication. As the historian of science Naomi Oreskes noted in 2019, the heterogeneity of the scientific community paradoxically supports the strength of the achieved consensus: “Objectivity is likely to be maximized when […] the community is sufficiently diverse that a broad range of views can be developed, heard, and appropriately considered.” This was obviously also clear to Miceli et al. when they took position in the debate on biases: “data never represents an absolute truth. Data, just like truth, is the product of subjective and asymmetrical social relations.” Ultimately, the processes that take place within such scientific communities lead to what is referred to as scientific truth. Data, statistics, machine learning and objectivity are embedded in social discourses, and in the last instance, the latter form the bridge to truth.

Large Language Models and their WEIRD Consequences

In his book “The Weirdest People in the World“, evolutionary psychologist Joseph Henrich focuses on a particular species that he calls “WEIRD people”. This play on words can be resolved because WEIRD stands for “white, educated, industrialised, rich, democratic”. Henrich wonders how it was possible for a small section of the population, most of whom live in the western world, to develop a range of very specific skills. He begins with the fact that over the last 500 years, the brains of these people have been changed by extensive reading and by the influence of Luther and his imperative to read the Bible independently. In order to characterise these changes and, in particular, to work out how a dynamic of acceleration and the driving of innovation as a motor of economic growth developed in Central Europe, he deals with educational institutions, urbanisation, the development of impersonal markets, supra-regional monastic orders, universities, knowledge societies, scholarly correspondence and the formation of new (protestant) religious groupings. If we wanted to continue Henrich’s study and extend it into the 21st century, we would have to look at the influence and changes that large language models (LLMs) have on the human brain. Although they have existed as recently as 2016 and have been available to a broad user base since autumn 2022 (ChatGPT) only, it is already possible to anticipate some – admittedly speculative – consequences of their use.

  1. We will (have to) learn how to deal with misinformation. LLMs are great fabricators, but they cannot distinguish between true and false. As highly efficient text generators, they can produce large amounts of factually incorrect content in no time at all, which feeds the internet, social media and the comment columns of news sites. This can lead to significant distortions in political discourse, for example, when elections are coming up – and this will be the case in 2024 in the USA, India, probably the UK and numerous other countries around the world. It therefore comes as no surprise that even the World Economic Forum, in its Global Risks Report this year, lists misinformation and disinformation used for the purpose of deception among the greatest risks with a short-term impact. As LLMs produce texts that predict the most likely next word, they generate articles that may sound plausible, but are often at least not entirely correct in terms of content and facts. A WEIRD consequence will therefore be that the human brain will have to learn discernment skills in order to accurately identify (and reject) this synthetic content.
  2. We will (have to) sharpen our concept of authenticity. In April 2023, Berlin-based photographer Boris Eldagsen rejected the prestigious Sony World Photography Award on the grounds that the authentic-looking image of two women was AI-generated. The jury responsible for the award was unable to distinguish the image entitled “Pseudomnesia: The Electrician” from a photo taken with a conventional camera. However, our viewing habits and perceptual routines are geared towards viewing photographs as faithful representations of reality. We will undoubtedly have to learn and adapt our concept of authenticity here, as multimodal LLMs have also become extremely powerful in the area of moving images. In January 2024, a study revealed that over 100 deepfake videos by Rishi Sunak had been distributed as adverts on Facebook in recent weeks. Both examples demonstrate the manipulability of our perception, lead to irritation, disturbance and scepticism and point to the fact that we need to relearn how to deal with AI-generated visual content.
  3. We will (have to) come to terms with the fascination of visual worlds. Generative pretrained transformers (GPTs) will soon not only be able to generate texts, but will also be able to create complete three-dimensional visual worlds. This is exactly what Mark Zuckerberg’s vision of the metaverse is aimed at: To create virtual worlds that are so overwhelmingly fascinating that users can no longer detach themselves from them; in other words, visual worlds that are highly addictive. The attraction of virtual realities, as they have been known in the gaming industry up to now, is thus potentiated. In order not to become completely dependent on these worlds and not to lose touch with reality, we will therefore have to adapt our cognitive abilities – certainly a WEIRD competence in Henrich’s sense.

These three examples show only the most likely consequences that the widespread use of LLMs will have on our brains. Many others are conceivable, such as the atrophy of the ability to conceptualise complex texts (also a WEIRD ability). In terms of the plasticity of our brains, the arrival of LLMs and their output is thus in line with historical upheavals such as the invention of printing and the introduction of electronic mass media and their consequences for cognitive organisation and social coexistence. It is no understatement to say that the concept of representation needs to be redefined. So far, humanity has coped quite well with these epochal upheavals. We will see how the WEIRD consequences will play out in practice.

Power Hungry Magic

“Any sufficiently advanced technology is indistinguishable from magic”, Arthur C. Clarke already knew, and it is part of the magic of new technologies that their downsides are systematically concealed. This is also the case with the energy consumption of large language models (LLMs): As with the schnitzel that ends up on consumers’ plates and makes them forget the relation to the realities of factory farming, so it is with the marvels of artificial intelligence. Information about the computing power required to create products such as ChatGPT and the big data used is not provided, either to avoid making data protection and copyright issues too obvious or to avoid having to quantify the energy consumption and CO2 emissions involved in training and operating these models. The reputable newspaper Die Zeit estimated in March 2023: “For the operation of ChatGPT, […] costs of 100,000 to 700,000 dollars a day are currently incurred” and noted “1,287 gigawatt hours of electricity” or “emissions of an estimated 502 tonnes of CO2” for the training of GPT-3 (Art. “Hidden energy”, in: Die Zeit No. 14, 30.03.2023, p.52). Against this backdrop, it comes as no surprise that, according to the International Energy Authority, the electricity consumption of the big tech companies Amazon, Microsoft, Google and Meta doubled to 72 TWh between 2017 and 2021; these four companies are also the world’s largest providers of commercially available cloud computing capacity.

Recently, Sasha Luccioni, Yacine Jernite and Emma Strubell presented the first systematic study on the energy consumption and CO2 emissions of various machine learning models during the inference phase. Inference here means the operation of the models, i.e. the period of deployment after training and fine-tuning the models. Inference accounts for around 80 to 90 percent of the costs of machine learning, on a cloud computing platform such as Amazon Web Services (AWS) around 90 per cent according to the operator. The study by Luccioni et al. emphasises the differences between various machine learning applications: The power and CO2 intensity is massively lower for text-based applications than for image-based tasks; similarly, it is massively lower for discriminative tasks than for generative ones, including generative pretrained transformers (GPTs). The differences between the various models are considerable: “For comparison, charging the average smartphone requires 0.012 kWh of energy which means that the most efficient text generation model uses as much energy as 16% of a full smartphone charge for 1,000 inferences, whereas the least efficient image generation model uses as much energy as 950 smartphone charges (11.49 kWh), or nearly 1 charge per image generation.” The larger the model, the faster the same amount of electricity is consumed or CO2 emitted during the inference phase as during the training phase.

Since ‘general purpose applications’ for the same task consume more energy than models that have been trained for a specific purpose, Luccioni et al. point out several trade-offs: Firstly, the trade-off between model size vs. power consumption, as the benefits of multi-purpose models must be weighed against their power costs and CO2 emissions. Secondly, the trade-off between accuracy/efficiency and electricity consumption across different models, because the higher the accuracy or the higher the efficiency of a model, the lower the power consumption of specific models, whereas multi-purpose models can fulfil many different tasks, but have a lower accuracy and higher electricity consumption. According to the authors, these empirically proven findings call into question, for example, whether it is really necessary to operate multi-purpose models such as Bard and Bing, i.e. they “do not see convincing evidence for the necessity of their deployment in contexts where tasks are well-defined, for instance web search and navigation, given these models’ energy requirements.”

The hunger for power of large general purpose models does not bring the “limits to growth” to the attention of leading entrepreneurs and investors of Western big tech companies, like the famous Club of Rome report more than 50 years ago. On the contrary, CEOs such as Jeff Bezos, whose empire also includes the largest cloud computing platform AWS, fear stagnation: “We will have to stop growing, which I think is a very bad future.” Visions such as the Metaverse are extremely costly in terms of resource consumption and emissions, and it is fair to ask whether AI applications will really be available to all of humanity in the future or only to those companies or individuals who can afford them. Nothing of all of this is even remotely sustainable. Given the growing power consumption of Western big tech companies and the fact that the core infrastructure for the development of AI products is already centralised by those few players, it remains unclear where the development of ‘magical’ AI applications will lead. Scientist Kate Crawford has given her own answer to this in her book “Atlas of AI“: Into space, because that’s where the resources are that these corporations need.

Feeding the Cuckoo

Large Language Models (LLMs) combine words that frequently appear in similar contexts in the training dataset; on this basis, they predict the most probable word or sentence. The larger the training dataset, the more possible combinations there are, and the more ‘creative’ the model appears. The sheer size of models such as GPT-4 already provides a competitive advantage that is hard to match: There are only a handful of companies in the world that can combine exorbitant computing power, availability of big data and an enormous market reach to create such a product. No research institutions are involved in the current competition, but the big tech companies Microsoft, Meta and Google are. However, few players and few models also mean a “race to the bottom” in terms of security and ethics, as the use of big data with regard to LLMs most often also means that the training data contains sensitive and confidential information as well as copyrighted material. In numerous court cases, the tech giants have been accused of collecting the data of millions of users online without their consent and violating copyright law in order to train AI models.

Internet users have therefore already helped to feed the cuckoo child. Google published this fact indirectly by updating its privacy policy in June 2023: “use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” Less well known, however, is the fact that the big tech companies also train their models, such as Bard, with what users entrust to them. In other words, everything you tell a chatbot can in turn be used as training material. In Google’s own words, it sounds like this: “Google uses this data to provide, improve, and develop Google products, services, and machine-learning technologies.” One consequence of the design of LLMs, however, is that the output of generative models cannot be controlled; there are simply too many possibilities with large models. If the LLM was and is trained on private or confidential data, this can lead to these data being disclosed and confidential information being revealed. Therefore, the training data should already comply with data protection regulations, which is why there are repeated calls for transparency with regard to training data.

Consequently, in its Bard Privacy Help Hub, Google warns users of the model not to feed it with sensitive data: “Please don’t enter confidential information in your Bard conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.” This is interesting insofar as the AI hype is fuelled by terms such as ‘disruption’, but at the same time it remains unclear what the business model with which big tech companies want to generate profits in the medium term should look like – and what exactly the use case should look like for average users. One use case, however, is the generation of texts that are needed on a daily basis, namely well-formulated application letters. However, if you upload your own CV for this purpose, you’re just feeding the cuckoo again. And that is not in our interest: After all, privacy is (also) a commons.

Human-Machine-Cognition

Humans search for themselves in non-human creatures and inanimate artefacts. Apes, the “next of kin”, or dogs, the “most faithful companions” are good examples of the former, robots are good examples of the latter: A human-like design of the robots’ bodies and a humanising linguistic framing of their capabilities supports, according to a common hypothesis, the anthropomorphisation of these machines and, as a consequence, the development of empathetic behaviour towards robots. The tendency to anthropomorphise varies from person to person; there are “stable individual differences in the tendency to attribute human-like attributes to nonhuman agents“.

Large Language Models (LLMs) are not (yet) associated with human-like body shapes. However, this does not mean that they are not subject to the human tendency to anthropomorphise. Even a well-formulated sentence can lead us to wrongly assume that it was spoken by a rational agent. Large language models are now excellently capable of reproducing human language. They have been trained on linguistic rules and patterns and have an excellent command of them. However, knowledge of the statistical regularities of language does not enable “understanding”. The ability to use language appropriately in a social context is also still incompletely developed in LLMs. They lack the necessary world knowledge, sensory access to the world and commonsense reasoning. The fact that we nevertheless tend to understand the text produced by generative pretrained language models (GPTs) as human utterances is on the one hand due to the fact that these language models have been trained on very large volumes of 21st century text and can therefore perfectly replicate our contemporary discourse. If the way in which meaning is produced through language corresponds to our everyday habits, then it can come as no surprise that we attribute “intelligence”, “intentionality” or even “identity” to the producer of a well-crafted text. In this respect, LLMs confirm the structuralist theories of the second half of the 20th century that language is a system that defines and limits the framework of what can be articulated and thus ultimately thought. And in this respect, LLMs also seem to confirm Roland Barthes’ thesis of the “death of the author”. The infinite recombination of the available word material and the prediction of the most probable words and sentences seem to be enough for us to recognise ourselves in the text output.

On the other hand, the specific design of chatbots supports anthropomorphisation. ChatGPT, for example, has been trained on tens of thousands of question-answer pairs. Instruction fine-tuning ensures that the model generates text sequences in a specific format. The LLM interprets the prompt as an instruction, distinguishes the input of the interlocutor or questioner from the text produced by itself and draws conclusions about the human participants. On the one hand, this means that the language model is capable to adapt the generated text to the human counterpart and to imitate sociolects; on the other hand, it creates in humans the cognitive illusion of a dialogue. The interface of apps such as ChatGPT further supports this illusion; it is designed like all the other interfaces used for human conversations. We humans then follow our habits and, in the dialogue with the chatbot, add the social context that is characteristic of a conversation and assume intentionality on the other side. Finally, ChatGPT was trained as a fictional character that provides answers in the first person. The language model therefore produces statements about itself, for example about its ethical and moral behaviour, its performance, privacy and the training data used. If a user asks for inappropriate output, the language model politely declines. These statements can therefore best be understood as an echo of the training process, as what OpenAI would like us to believe about this technology. The dialogue form and the fictional character reporting in the first person are the only ways in which OpenAI can control the output of the language model.

All of this can be summarised as “anthropomorphism by design”. It is therefore no wonder that we humans tend to ascribe human characteristics to a disembodied language model. However, while we are learning how to use such chatbots, we must not succumb to the illusion that we are dealing with a human interlocutor. Empathetic statements or emotions uttered by the bot are simulations that can become extremely problematic if we e.g. confuse the bot with a therapist. The assumption that a language model could be suitable for making decisions and therefore take on the role of lawyers, doctors or teachers is also misleading: in the end, it is still humans who take responsibility for such decisions. Therefore, we must not be tricked by an anthropomorphising design. The cognition that humans have anything other than a machine as counterpart is deceptive: there is no one there.

On the Tyranny of the Majority

Large Language Models (LLMs) predict the statistically most probable word when they generate texts. The fact that the predicted word or sentence is the most probable does on the one hand not mean that it is true or false. On the other hand, the prediction of probabilities leads to a favouring of the majority opinion. If one word combination appears significantly more frequently than the other in the training data set, it is favoured by the LLM; and also if the annotators assign a certain label more frequently than another, the more frequently assigned label is favoured and that of the minority opinion is suppressed. This “tyranny of the majority” has consequences for at least two important areas of society: For science and culture.

If we consider how Thomas Kuhn conceptualises the “structure of scientific revolutions” and Pierre Bourdieu the renewal of cultural fields, it becomes evident that every new scientific paradigm and every artistic avant-garde movement represents a minority opinion, at least initially. There is a dominant majority opinion, which Kuhn describes as paradigmatic “normal science” and Bourdieu as the “orthodox” conception of art. These social groups form the dominant pole in their respective fields and are challenged in a competition by a “revolutionary” (Kuhn) or “heretical” (Bourdieu) position. The representatives of the dominant opinion often react negatively to this opposition: “Normal science, for example, often suppresses fundamental novelties because they are necessarily subversive of its basic commitments.” (Kuhn, Structure, p.5). What follows is, sociologically speaking, a struggle for recognition, a fight for the rejection of an older scientific or artistic paradigm and the introduction of a new one.

The trial of strength between the different groups of scientists or artists can lead to different results. For example, the new paradigm completely replaces the old one and assumes a dominant position in the field itself. This happened, e.g., in the study of syphilis when pathogens were identified for the first time at the beginning of the 20th century. Another possibility is that two different scientific paradigms (or schools of art) can coexist, such as Newtonian and Einsteinian mechanics; the decisive factor here is that both have a different frame of reference that is mutually exclusive (just as scientists often first have to develop a new ‘school of seeing’ and collect new data). Still another possibility is that two different paradigms exist side by side without the majority ratios changing. This is the case, for example, with the different interpretations of quantum mechanics: the stochastic or Copenhagen interpretation of quantum mechanics forms the majority opinion, while the deterministic or Bohmian theory represents a minority opinion. In the field of art, one can think of the overcoming of tonality and the development of the twelve-tone technique by avant-gardists such as Arnold Schönberg and Alban Berg. Although this technique was later taken up, it did not develop into the dominant method and never really became suitable for the masses (while tonality is still decisive for the majority of consumers today). Max Planck once commented ironically on the longevity of outdated scientific paradigms and their representatives with the words: “Science progresses with one funeral at a time.

The way in which Kuhn and Bourdieu conceptualise the processes of renewal in the fields of science and culture focuses primarily on the social processes associated with scientific or artistic revolutions. With regard to LLMs and the hopes associated with artificial general intelligence (AGI), this is instructive: due to its design, such ‘intelligence’ tends to repeat the majority opinion and thus to repeat the dominant paradigm (field of science) or it tends towards the commonplace, cliché, banal and inauthentic (field of art). This does not mean that intelligent machines cannot be used to create new paradigms. But they will not do so ‘by themselves’. Rather, it is clear that we have to view the seemingly overpowering AIs in the wider context of a socio-technical system in which humans still play a central role as agents – even if they are in the minority.

Human-Machine-Creativity

Language models that generate texts on the basis of probabilities are best approached with solid scepticism with regard to factual accuracy, and with a little humour. Jack Krawczyk, who is responsible for the development of the chatbot “Bard” at Google, openly admitted in March 2023: “Bard and ChatGPT are large language models, not knowledge models. They are great at generating human-sounding text, they are not good at ensuring their text is fact-based.” Calling a language model with a wry wink “bard” hits the nail on the head: bards write poetry, tell stories and don’t necessarily stick to the truth, as we know since Plato.

Creating texts, especially literary texts, was previously the preserve of humans. Large Language Models (LLMS), however, are surprisingly good at identifying and replicating literary styles and genres. So how can we imagine literary text production from now on? Terms such as “consciousness”, “memory”, “intentionality” and “creativity” are surprisingly poorly defined, both for humans and for machines. With regard to the latter, the British cognitive scientist Margaret A. Boden has already dealt with the differences between human and machine creativity in her book “The Creative Mind” – emphasising that machines only appear to be creative to a certain degree. She distinguishes between three forms of creativity: a) making unfamiliar combinations of familiar ideas; b) explorative creativity; and c) transformative creativity.

Producing unknown combinations from known ideas is certainly what LLMs are good at, because that is how they are built: calculating the most likely recombinations from the data available, following the patterns present in the data. It should therefore no longer be a great challenge for an LLM to fabricate a short story in 99 different styles and thus replicate Raymond Queneau’s famous “Exercises de Style“. Literary variations such as permutations, rhyme forms, jargons, narrative perspectives, sociolects, etc. should be producible with a single prompt. The phrase “a single prompt” reveals the vagueness of the term “intention”: A human must enter the prompt and thus act “intentionally”; the machine takes care of everything else.

The second form of creativity, according to Boden, explores conceptual spaces, which we can imagine as established genres in the field of literature. Genres follow rules that outline the space in which the literary action takes place; sociologist Pierre Bourdieu described them as the “rules of art”. Not everything is possible in every genre: while in crime fiction the dead do not come back to life or move around as living corpses, this is certainly possible in fantasy or horror literature. LLMs are able to identify such spaces of possibilities and replicate the patterns that characterise them. Especially when the underlying data contains many examples of literary genres such as historical novels, fantasy and romance novels along with their characteristic styles and topoi, LLMs can reliably produce recombinations and thus explore the conceptual space. Since these spaces offer many possibilities, not all of which are equally attractive to human readers, we can think of these combinatorial explorations as human-machine collaborations: A human develops a sketch of a novel and lets the outlined plot be formulated chapter by chapter by the machine. Such collaborations can be criticised from an economic rather than from an aesthetic perspective: In order to know the current space of possibilities, LLMs must also have access to material that is under copyright. When it comes to systems like ChatGPT, the data basis of which is not disclosed, this amounts to a privatisation of culture that was once public. And, to use an old argument: Here, human labour is being replaced by a machinery that enables the relevant companies to skim off the generated surplus.

The third form of creativity described by Margaret Boden aims to transform the conceptual space. Here, the rules that describe this space are broken and new ones are established. We can think, for example, of Marcel Duchamp’s urinal entitled “Fountain“, Picasso’s first cubist painting “Les Demoiselles d’Avignon” or Italo Calvino’s “Le città invisibili“. However, in order to redesign the conceptual space, you first have to know it and be able to name the rules that characterise it in order to be able to realise such a transformative work in collaboration with a machine. A LLM cannot achieve this, as such models do not reflect their own activity, lack generalisable word knowledge, and their heuristics are geared towards identifying patterns, but not towards creating new ones. This is where human and machine creativity separate: human creativity has knowledge of the world and a (possibly intuitive) knowledge of the rules of a conceptual space; in a movement of departure from known concepts new solutions are found, radical ideas are developed and new rules are established. Transformative creativity enables humans to create new works in collaboration with a machine; however, the intention to leave the known space of possibilities seems to be  (still) reserved for humans.

It’s the statistics, stupid

“It’s the statistics, stupid”, one could say when it comes to dealing with generative pretrained transformers (GPTs). Yet, we all still have to learn this, only one year after the presentation of ChatGPT. Statistical correlations are key to understanding how stochastic prediction models work and what they are capable of.

Put in simple terms, machine learning consists of showing a machine data on the basis of which it learns or memorises what belongs to what. This data is called the training data set. Once the machine has learnt the correlations, a test data set is presented to the model, i.e. data that it has not yet seen. The result can be used to measure how well a machine has learnt the correlations. The basic principle here is that probability models are trained from as much representative initial data as possible (i.e. examples) in order to be able to apply them to further unseen data. The quality of such a model therefore always depends on how voluminous and varied and of which quality the initial data used for training is.

Large language models (LLMs) are trained to write texts on specific topics, to provide answers to questions and to create the illusion of a dialogue. The machine is shown a large number of texts in which individual words are “masked” or hidden, for example: “It’s the [mask], stupid”. In response to the question: “What is this election about?”, the model then makes a prediction as to which word—based on the training data—would most likely be in the place of [mask], in this case “economy”. In principle, “deficit”, “money” or “statistics” could just as well be used here, but “economy” is by far the most common term in the training data and therefore the most likely word. The language model combines words that often appear in similar contexts in the training data set. The same applies to whole sentences or even longer texts.

However, the fact that LLMs predict probabilities has serious consequences. For example, the fact that the sentence predicted by a model is probable says nothing about whether this sentence is true or false. The generated texts may also contain misinformation such as outdated or false statements or fictions. Language models such as ChatGPT do not learn patterns that can be used to evaluate the truth of a statement. It is therefore the task of the people using the chatbot to check the credibility or truthfulness of the statement and to contextualise it. We should all learn how to do this, just as we learnt “back then” to check the reliability of a source presented as the result of a Google search. For some areas of life, the distinction between true and false is central, for example in science. A generative model that is able to produce scientific texts but cannot distinguish between true and false is therefore bound to fail—as was the case with the “Galactica” model presented by Meta, which was trained on the basis of 48 million scientific articles. Consequently, such a model will also raise questions about good scientific practice. Since science is essentially a system of references, the fact that generative models such as ChatGPT ‘fable’ references (i.e. generate a probable sequence of words) when in doubt is a real problem. It can therefore come as no surprise that the word ‘hallucinate’ has been named Word of the Year 2023 by the Cambridge Dictionary.

Furthermore, the truthfulness of facts depends on the context. This may sound strange at first. But even the banal question: “What is the capital of the Federal Republic of Germany?” shows that the answer can vary. Just over 30 years ago, “Bonn on the Rhine” would have been the correct answer. And the answer to the question “What is this election about?” would probably be different today than it was 30 years ago (spoiler suggestion: oligarchy vs. democracy). With regard to science, it becomes even more complex: the progress of scientific knowledge means that statements that were considered true and factual just a few decades ago are now considered outdated. Programming code also requires people to check the code generated by a generative model. This is the reason why one of the most important platforms for software developers, Stackoverflow, still does not allow answers generated by such models, as there is a realistic risk that they provide false or misleading information or malicious code. Large language models cannot verify the truth of a statement because, unlike humans, they do not have world knowledge and therefore cannot compare their output with the relevant context.

Beyond science and software development, a serious risk of language models in general is the creation of misinformation. If such models are used to generate (factually incorrect) content that is disseminated via social media or fills the comment fields of news sites, this can have serious consequences—they can increase polarisation and mistrust within a society or undermine shared basic convictions. This can have significant political consequences: In 2024, for example, new governments will be elected in the USA and India, and we can assume that these election campaigns will be largely decided by the content provided on social media. Is it the stupid statistics?

On the Use of Licences in Times of Large Language Models

It could all be so simple: cultural heritage institutions and other public sector bodies provide high-quality data on a large scale and, wherever possible, under a permissive licence such as CC0 or Public Domain Mark 1.0. This is in line with the idea that cultural heritage institutions are funded by taxes, therefore everyone should also benefit from their services and products; in the case of data, innovation, research and of course private use should be possible.

However, we live in times of large language models and exploitative practices, especially of US-american big tech companies. Here, data are extracted from the web on a large scale and processed into proprietary large language models. These companies are not only the drivers of innovation, but also set themselves apart from research institutions, for example, by having specifically trained data sets at their disposal as well as exceptional computing power and the best-paid positions for developers of algorithms; all these elements are expensive ingredients of a recipe for success in the face of limited competition.

One of the weaknesses of ChatGPT – and presumably of GPT-4 – is its lack of reliability. This weakness results from the inability of purely stochastic language models to distinguish between fact and fiction; but also from a lack of data. Especially with regard to “hallucinated” literature references, bibliographic data from libraries are very attractive for building large language models. Another problem is the lack of high-quality text data. According to a recently published study, high-quality text data will be exhausted before the year 2026; this is mainly due to the lack of etiquette and proper spelling on the internet. But who, if not the libraries, have huge stocks of high-quality text data? Almost all the content available here has passed through a quality filter called “publishing houses”. One may be divided about the intellectual quality of the books; but linguistically and orthographically, everything that was printed until the end of the 20th century (i.e. before the advent of self-publishing) is of very good quality.

Finally, dear money: inflation is back, the low-interest phase is gone, the first Silicon Valley bank went bankrupt. Many companies based there will soon need fresh money; there will soon be monetisation to generate profits. New and more capable models will soon be created from products (such as ChatGPT) that were previously offered free of charge, providing demand-driven services in exchange for payment.

Should cultural heritage institutions as public entities serve the maximisation of the profits of a few companies by providing expensive and resource-intensive (and tax-funded) data for free? The answer has to be differentiated and therefore complicated. Of course, data should also be made available under permissive licences, as has been the case up to now. A dual strategy can certainly be used here. On the one hand, data made available via interfaces such as OAI-PMH or IIIF continue to be accessible under CC0 licence or or Public Domain Mark 1.0; technical access restrictions can prevent large-scale data extraction, e.g. by controlling IP addresses or download maxima. On the other hand, specific data publications can be provided that bundle individual data sets to enable research and innovation; such offerings are protected as databases for 15 years, and here licences can be used that contain a “NC” (non-commercial) mark and make such data usable for research and innovation. As an example, the Prussian Cultural Heritage Foundation uses such a licence (CC-BY-NC-SA) for the digital representation of one of its masterpieces, and the (not so easy to use) 3D scan is also freely available under this licence (download here).

Interestingly, the European Union anticipated the case described above in the Data Governance Act and included a relevant set of instruments. There is a chapter on the use of data provided by public sector bodies (Chapter II, Article 6), which regulates the provision of data in exchange for fees. It states that public sector bodies may differentiate the fees they charge between private users, small and medium-sized enterprises (SMEs) and start-ups on the one hand and larger corporations on the other, which don’t fall under the former definition. In this way, a possibility for differentiation within the framework of commercial users is created, whereby the fees have to be oriented at the costs of the infrastructure to provide data. This is something rather atypical in the European legal system, since the principle of equal treatment applies. Cultural heritage institutions thus have EU Commissioner for Competition Margrethe Vestager on their side, who presented the Data Governance Act in 2020 (that is applicable from 24 September 2023, by the way). Vestager is also Executive Vice President of the European Commission for a Europe Fit for the Digital Age and has imposed more than 15 billion Euros in antitrust fines in her first five years in office. So the enforcing political will seems to be there.

In case of doubt, this will be necessary. Licences like CC-BY-SA-NC effectively prevent the use of public data for commercial exploitation in large language models. But since the creators of large language models are moving around in a minefield regarding copyright, and in the case of other models, a stock photo agency or other rights holders have already filed copyright lawsuits, one must unfortunately doubt that they will show consideration in the future. Of course, the relevant court decisions remain to be seen in the pending cases. Even with reverse engineering, it is not easy to prove which data sets have been incorporated into a large language model; therefore, a kind of circumstantial evidence would have to be provided. In the medium and long term, it therefore seems more sensible to focus on establishing validation processes and standards that have to be implemented prior to publishing AI models. This includes the disclosure of the training material and process, its evaluation by experts, code audits, but also a reversal of evidence with regard to the licensing of the data material used. Making such procedures an obligatory part of the approval of commercial AI applications is then actually the task of the European Union.

Finally, another way is to publish cultural heritage data in a separate Data Space for Cultural Heritage; the tender for this Data Space was launched last autumn and is part of the European Union’s Data Act. To what extent this Data Space will grant full data sovereignty  to cultural heritage institutions and thus the possibility to control access to data publications remains to be seen.