Tag Archive for: AI

On Objectivity – and the Bridge to Truth

Statistics are held in high regard. Although the saying goes like “do not trust any statistics you did not fake yourself”, they are often regarded as a prime example of objectivity based on the foundation of large datasets. This view is taken to the extreme when it comes to machine learning: machine learning models are statistical learners. A recently published research article criticizes this view: “the mythology surrounding ML presents it—and justifies its usage in said contexts over the status quo of human decision-making—as paradigmatically objective in the sense of being free from the influence of human values” (Andrews et al. 2024).

The fact that machine learning is seen as an extreme case of objectivity has its origins in the 19th century. Back then, the foundations of our current understanding of objectivity were laid. Human (and fallible) subjectivity was contrasted with mechanical objectivity. At that time, machines were considered to be free from willful intervention, which was seen as the most dangerous aspect of subjectivity (Daston / Galison 2007). To this day, machines – be they cameras, sensors or electronic devices, or even the data they produce – have become emblematic for the elimination of human agency and the embodiment of objectivity without subjectivity. These perceptions persist, and it becomes necessary to explain why common sense continues to attribute objectivity and impartiality to data, statistics and machine learning.

The debate of the 19th century has its revenant today in the discussion about biases. The fact that every dataset contains statistical distortions is obviously not compatible with the attribution of objectivity, which is supposed to be inherent in particular in large datasets. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size becomes more important. On the other hand, “large” does not mean “all”; rather, one must be aware of the universe covered by the data. Statistical inference, i.e. conclusions drawn from data about the population as a whole, cannot be easily applied because the datasets are not established to ensure representativeness (Kitchin 2019). A recent article states with regard to biases: “Data bias has been defined as ‘a systematic distortion in the data’ that can be measured by ‘contrasting a working data sample with reference samples drawn from different sources or contexts.’ This definition encodes an important premise: that there is an absolute truth value in data and that bias is just a ‘distortion’ from that value. This key premise broadly motivates approaches to ‘debias’ data and ML systems” (Miceli et al. 2022). What sounds like objectivity and ‘absolute truth value’ because it is based on large datasets, statistics and machine learning models is not necessarily correct, because if the model is a poor representation of reality, the conclusions drawn from the results may be wrong. This is also the reason why Cathy O’Neil in 2016 described an algorithm as “an opinion formalized in code” – it does not simply offer objectivity, but works towards the purposes and goals for which it was written.

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty)

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty). Staatliche Museen zu Berlin, Egyptian Museum und Papyrus Collection. Public Domain Mark 1.0
A historical visualisation of scientists communicating with each other and harmonising their views in the sense of the community being above the individuum?

The fact that scientists – and the machine learning community in particular – still adhere to the concept of objectivity and the objective nature of scientific knowledge is due to the fact that the latter is socially constructed because it is partly derived from collective beliefs held by scientific communities (Fleck 1935/1980). Beyond the activity of the individual researcher, the embedding of research results within a broader scientific discourse shows that scientific research is a collective activity. Much of what is termed ‘science’ is based on social practices and procedures of adjudication. As the historian of science Naomi Oreskes noted in 2019, the heterogeneity of the scientific community paradoxically supports the strength of the achieved consensus: “Objectivity is likely to be maximized when […] the community is sufficiently diverse that a broad range of views can be developed, heard, and appropriately considered.” This was obviously also clear to Miceli et al. when they took position in the debate on biases: “data never represents an absolute truth. Data, just like truth, is the product of subjective and asymmetrical social relations.” Ultimately, the processes that take place within such scientific communities lead to what is referred to as scientific truth. Data, statistics, machine learning and objectivity are embedded in social discourses, and in the last instance, the latter form the bridge to truth.

Openness, Efficiency and Closed Infrastructures

The concept of data spaces, that the European Commission is pursuing, is not only a technical one; it also implies a political constitution. Data spaces such as GAIA-X do not require centralised management. The operation of such a data space can take place within a federation that establishes the means to control data integrity and data trustworthiness. The federation that operates the data space is therefore more like the European Union (i.e. a federation of states) than like a centralised democracy. And trust is not only something that characterises cultural heritage institutions in terms of data and machine learning models. Such institutions fulfil their mission on the basis of the trust that people place in them, a trust that has grown over decades or centuries and is an expression of people’s conviction that these renowned and time-honoured institutions make the right decisions and, for example, make the right choices when acquiring their objects.

The political concept of data spaces thus stands in clear contrast to the hierarchical and opaque structures of big tech companies. With regard to data and machine learning models, a clear centralisation movement can be observed in the relevant corporations (Alphabet, Meta, Amazon, Microsoft) since the 2010s, particularly with regard to research and development and the provision of infrastructure. A study published in 2022 on the values that are central to machine learning research emphasises two insights: firstly, the presence of large tech companies in the 100 most-cited studies published in the two most influential machine learning conferences is massively increasing. “For example, in 2008/09, 24% of these top cited papers had corporate affiliated authors, and in 2018/19 this statistic more than doubled, to 55%. Moreover, of these corporations connected to influential papers, the presence of “big-tech” firms, such as Google and Microsoft, more than tripled from 21% to 66%.” This means that tech companies are almost as frequently involved in the most important research as the most important universities. Putting the consequences of this privatisation of research for the distribution of knowledge production in Western societies into perspective would be worthy of its own studies. On the other hand, the study by Birhane et al. emphasises a value that is repeatedly highlighted in the 100 examined research articles: Efficiency. The praise of efficiency is in this case not neutral, as it favours those institutions that are able to process constantly growing amounts of data and procure and deploy the necessary resources. In other words, emphasising a technical-sounding value such as efficiency “facilitates and encourages the most powerful actors to scale up their computation to ever higher orders of magnitude, making their models even less accessible to those without resources to use them and decreasing the ability to compete with them.”

Feigned door of Sokarhotep, symbolising the feigned openness of AI applications provided by big tech

Feigned door of Sokarhotep, Old Kingdom, 5. Dynasty. Ägyptisches Museum und Papyrussammlung. CC BY-SA 4.0.
The feigned door of Sokarhotep symbolises the feigned openness of AI applications provided by big tech

This already addresses the second aspect, the power of disposal over infrastructure. There is no doubt that there is already a “compute divide” between the big tech companies and e.g. elite universities. Research and development in the field of machine learning is currently highly dependent on the infrastructure provided by a small number of actors. This situation also has an impact on the open provision of models. When openness becomes a question of access to resources, scale becomes a problem for openness: Truly open AI systems are not possible if the resources needed to build them from scratch and deploy them on a large scale remain closed because they are only available to those who have these significant resources at their disposal. And these are almost always corporations. A recently published study on the concentration of power and the political economy of open AI therefore concludes that open source and centralisation are mutually exclusive: “only a few large tech corporations can create and deploy large AI systems at scale, from start to finish – a far cry from the decentralized and modifiable infrastructure that once animated the dream of the free/open source software movement”. A company name like “OpenAI” thus becomes an oxymoron.

Against this backdrop, it becomes clear that the European concept of data spaces represents a counter-movement to the monopolistic structures of tech companies. The openness, data sovereignty and trustworthiness that these data spaces represent will not open up the possibility of building infrastructures that can compete with those of the big tech companies. However, they will make it possible to develop specific models with clearly defined tasks that work more efficiently than the general-purpose applications developed by the tech companies. In this way, the value of efficiency, which is central to the field of machine learning, could be recoded.

Large Language Models and their WEIRD Consequences

In his book “The Weirdest People in the World“, evolutionary psychologist Joseph Henrich focuses on a particular species that he calls “WEIRD people”. This play on words can be resolved because WEIRD stands for “white, educated, industrialised, rich, democratic”. Henrich wonders how it was possible for a small section of the population, most of whom live in the western world, to develop a range of very specific skills. He begins with the fact that over the last 500 years, the brains of these people have been changed by extensive reading and by the influence of Luther and his imperative to read the Bible independently. In order to characterise these changes and, in particular, to work out how a dynamic of acceleration and the driving of innovation as a motor of economic growth developed in Central Europe, he deals with educational institutions, urbanisation, the development of impersonal markets, supra-regional monastic orders, universities, knowledge societies, scholarly correspondence and the formation of new (protestant) religious groupings. If we wanted to continue Henrich’s study and extend it into the 21st century, we would have to look at the influence and changes that large language models (LLMs) have on the human brain. Although they have existed as recently as 2016 and have been available to a broad user base since autumn 2022 (ChatGPT) only, it is already possible to anticipate some – admittedly speculative – consequences of their use.

  1. We will (have to) learn how to deal with misinformation. LLMs are great fabricators, but they cannot distinguish between true and false. As highly efficient text generators, they can produce large amounts of factually incorrect content in no time at all, which feeds the internet, social media and the comment columns of news sites. This can lead to significant distortions in political discourse, for example, when elections are coming up – and this will be the case in 2024 in the USA, India, probably the UK and numerous other countries around the world. It therefore comes as no surprise that even the World Economic Forum, in its Global Risks Report this year, lists misinformation and disinformation used for the purpose of deception among the greatest risks with a short-term impact. As LLMs produce texts that predict the most likely next word, they generate articles that may sound plausible, but are often at least not entirely correct in terms of content and facts. A WEIRD consequence will therefore be that the human brain will have to learn discernment skills in order to accurately identify (and reject) this synthetic content.
  2. We will (have to) sharpen our concept of authenticity. In April 2023, Berlin-based photographer Boris Eldagsen rejected the prestigious Sony World Photography Award on the grounds that the authentic-looking image of two women was AI-generated. The jury responsible for the award was unable to distinguish the image entitled “Pseudomnesia: The Electrician” from a photo taken with a conventional camera. However, our viewing habits and perceptual routines are geared towards viewing photographs as faithful representations of reality. We will undoubtedly have to learn and adapt our concept of authenticity here, as multimodal LLMs have also become extremely powerful in the area of moving images. In January 2024, a study revealed that over 100 deepfake videos by Rishi Sunak had been distributed as adverts on Facebook in recent weeks. Both examples demonstrate the manipulability of our perception, lead to irritation, disturbance and scepticism and point to the fact that we need to relearn how to deal with AI-generated visual content.
  3. We will (have to) come to terms with the fascination of visual worlds. Generative pretrained transformers (GPTs) will soon not only be able to generate texts, but will also be able to create complete three-dimensional visual worlds. This is exactly what Mark Zuckerberg’s vision of the metaverse is aimed at: To create virtual worlds that are so overwhelmingly fascinating that users can no longer detach themselves from them; in other words, visual worlds that are highly addictive. The attraction of virtual realities, as they have been known in the gaming industry up to now, is thus potentiated. In order not to become completely dependent on these worlds and not to lose touch with reality, we will therefore have to adapt our cognitive abilities – certainly a WEIRD competence in Henrich’s sense.

These three examples show only the most likely consequences that the widespread use of LLMs will have on our brains. Many others are conceivable, such as the atrophy of the ability to conceptualise complex texts (also a WEIRD ability). In terms of the plasticity of our brains, the arrival of LLMs and their output is thus in line with historical upheavals such as the invention of printing and the introduction of electronic mass media and their consequences for cognitive organisation and social coexistence. It is no understatement to say that the concept of representation needs to be redefined. So far, humanity has coped quite well with these epochal upheavals. We will see how the WEIRD consequences will play out in practice.

Power Hungry Magic

“Any sufficiently advanced technology is indistinguishable from magic”, Arthur C. Clarke already knew, and it is part of the magic of new technologies that their downsides are systematically concealed. This is also the case with the energy consumption of large language models (LLMs): As with the schnitzel that ends up on consumers’ plates and makes them forget the relation to the realities of factory farming, so it is with the marvels of artificial intelligence. Information about the computing power required to create products such as ChatGPT and the big data used is not provided, either to avoid making data protection and copyright issues too obvious or to avoid having to quantify the energy consumption and CO2 emissions involved in training and operating these models. The reputable newspaper Die Zeit estimated in March 2023: “For the operation of ChatGPT, […] costs of 100,000 to 700,000 dollars a day are currently incurred” and noted “1,287 gigawatt hours of electricity” or “emissions of an estimated 502 tonnes of CO2” for the training of GPT-3 (Art. “Hidden energy”, in: Die Zeit No. 14, 30.03.2023, p.52). Against this backdrop, it comes as no surprise that, according to the International Energy Authority, the electricity consumption of the big tech companies Amazon, Microsoft, Google and Meta doubled to 72 TWh between 2017 and 2021; these four companies are also the world’s largest providers of commercially available cloud computing capacity.

Recently, Sasha Luccioni, Yacine Jernite and Emma Strubell presented the first systematic study on the energy consumption and CO2 emissions of various machine learning models during the inference phase. Inference here means the operation of the models, i.e. the period of deployment after training and fine-tuning the models. Inference accounts for around 80 to 90 percent of the costs of machine learning, on a cloud computing platform such as Amazon Web Services (AWS) around 90 per cent according to the operator. The study by Luccioni et al. emphasises the differences between various machine learning applications: The power and CO2 intensity is massively lower for text-based applications than for image-based tasks; similarly, it is massively lower for discriminative tasks than for generative ones, including generative pretrained transformers (GPTs). The differences between the various models are considerable: “For comparison, charging the average smartphone requires 0.012 kWh of energy which means that the most efficient text generation model uses as much energy as 16% of a full smartphone charge for 1,000 inferences, whereas the least efficient image generation model uses as much energy as 950 smartphone charges (11.49 kWh), or nearly 1 charge per image generation.” The larger the model, the faster the same amount of electricity is consumed or CO2 emitted during the inference phase as during the training phase.

Since ‘general purpose applications’ for the same task consume more energy than models that have been trained for a specific purpose, Luccioni et al. point out several trade-offs: Firstly, the trade-off between model size vs. power consumption, as the benefits of multi-purpose models must be weighed against their power costs and CO2 emissions. Secondly, the trade-off between accuracy/efficiency and electricity consumption across different models, because the higher the accuracy or the higher the efficiency of a model, the lower the power consumption of specific models, whereas multi-purpose models can fulfil many different tasks, but have a lower accuracy and higher electricity consumption. According to the authors, these empirically proven findings call into question, for example, whether it is really necessary to operate multi-purpose models such as Bard and Bing, i.e. they “do not see convincing evidence for the necessity of their deployment in contexts where tasks are well-defined, for instance web search and navigation, given these models’ energy requirements.”

The hunger for power of large general purpose models does not bring the “limits to growth” to the attention of leading entrepreneurs and investors of Western big tech companies, like the famous Club of Rome report more than 50 years ago. On the contrary, CEOs such as Jeff Bezos, whose empire also includes the largest cloud computing platform AWS, fear stagnation: “We will have to stop growing, which I think is a very bad future.” Visions such as the Metaverse are extremely costly in terms of resource consumption and emissions, and it is fair to ask whether AI applications will really be available to all of humanity in the future or only to those companies or individuals who can afford them. Nothing of all of this is even remotely sustainable. Given the growing power consumption of Western big tech companies and the fact that the core infrastructure for the development of AI products is already centralised by those few players, it remains unclear where the development of ‘magical’ AI applications will lead. Scientist Kate Crawford has given her own answer to this in her book “Atlas of AI“: Into space, because that’s where the resources are that these corporations need.

Human-Machine-Cognition

Humans search for themselves in non-human creatures and inanimate artefacts. Apes, the “next of kin”, or dogs, the “most faithful companions” are good examples of the former, robots are good examples of the latter: A human-like design of the robots’ bodies and a humanising linguistic framing of their capabilities supports, according to a common hypothesis, the anthropomorphisation of these machines and, as a consequence, the development of empathetic behaviour towards robots. The tendency to anthropomorphise varies from person to person; there are “stable individual differences in the tendency to attribute human-like attributes to nonhuman agents“.

Large Language Models (LLMs) are not (yet) associated with human-like body shapes. However, this does not mean that they are not subject to the human tendency to anthropomorphise. Even a well-formulated sentence can lead us to wrongly assume that it was spoken by a rational agent. Large language models are now excellently capable of reproducing human language. They have been trained on linguistic rules and patterns and have an excellent command of them. However, knowledge of the statistical regularities of language does not enable “understanding”. The ability to use language appropriately in a social context is also still incompletely developed in LLMs. They lack the necessary world knowledge, sensory access to the world and commonsense reasoning. The fact that we nevertheless tend to understand the text produced by generative pretrained language models (GPTs) as human utterances is on the one hand due to the fact that these language models have been trained on very large volumes of 21st century text and can therefore perfectly replicate our contemporary discourse. If the way in which meaning is produced through language corresponds to our everyday habits, then it can come as no surprise that we attribute “intelligence”, “intentionality” or even “identity” to the producer of a well-crafted text. In this respect, LLMs confirm the structuralist theories of the second half of the 20th century that language is a system that defines and limits the framework of what can be articulated and thus ultimately thought. And in this respect, LLMs also seem to confirm Roland Barthes’ thesis of the “death of the author”. The infinite recombination of the available word material and the prediction of the most probable words and sentences seem to be enough for us to recognise ourselves in the text output.

On the other hand, the specific design of chatbots supports anthropomorphisation. ChatGPT, for example, has been trained on tens of thousands of question-answer pairs. Instruction fine-tuning ensures that the model generates text sequences in a specific format. The LLM interprets the prompt as an instruction, distinguishes the input of the interlocutor or questioner from the text produced by itself and draws conclusions about the human participants. On the one hand, this means that the language model is capable to adapt the generated text to the human counterpart and to imitate sociolects; on the other hand, it creates in humans the cognitive illusion of a dialogue. The interface of apps such as ChatGPT further supports this illusion; it is designed like all the other interfaces used for human conversations. We humans then follow our habits and, in the dialogue with the chatbot, add the social context that is characteristic of a conversation and assume intentionality on the other side. Finally, ChatGPT was trained as a fictional character that provides answers in the first person. The language model therefore produces statements about itself, for example about its ethical and moral behaviour, its performance, privacy and the training data used. If a user asks for inappropriate output, the language model politely declines. These statements can therefore best be understood as an echo of the training process, as what OpenAI would like us to believe about this technology. The dialogue form and the fictional character reporting in the first person are the only ways in which OpenAI can control the output of the language model.

All of this can be summarised as “anthropomorphism by design”. It is therefore no wonder that we humans tend to ascribe human characteristics to a disembodied language model. However, while we are learning how to use such chatbots, we must not succumb to the illusion that we are dealing with a human interlocutor. Empathetic statements or emotions uttered by the bot are simulations that can become extremely problematic if we e.g. confuse the bot with a therapist. The assumption that a language model could be suitable for making decisions and therefore take on the role of lawyers, doctors or teachers is also misleading: in the end, it is still humans who take responsibility for such decisions. Therefore, we must not be tricked by an anthropomorphising design. The cognition that humans have anything other than a machine as counterpart is deceptive: there is no one there.

On the Tyranny of the Majority

Large Language Models (LLMs) predict the statistically most probable word when they generate texts. The fact that the predicted word or sentence is the most probable does on the one hand not mean that it is true or false. On the other hand, the prediction of probabilities leads to a favouring of the majority opinion. If one word combination appears significantly more frequently than the other in the training data set, it is favoured by the LLM; and also if the annotators assign a certain label more frequently than another, the more frequently assigned label is favoured and that of the minority opinion is suppressed. This “tyranny of the majority” has consequences for at least two important areas of society: For science and culture.

If we consider how Thomas Kuhn conceptualises the “structure of scientific revolutions” and Pierre Bourdieu the renewal of cultural fields, it becomes evident that every new scientific paradigm and every artistic avant-garde movement represents a minority opinion, at least initially. There is a dominant majority opinion, which Kuhn describes as paradigmatic “normal science” and Bourdieu as the “orthodox” conception of art. These social groups form the dominant pole in their respective fields and are challenged in a competition by a “revolutionary” (Kuhn) or “heretical” (Bourdieu) position. The representatives of the dominant opinion often react negatively to this opposition: “Normal science, for example, often suppresses fundamental novelties because they are necessarily subversive of its basic commitments.” (Kuhn, Structure, p.5). What follows is, sociologically speaking, a struggle for recognition, a fight for the rejection of an older scientific or artistic paradigm and the introduction of a new one.

The trial of strength between the different groups of scientists or artists can lead to different results. For example, the new paradigm completely replaces the old one and assumes a dominant position in the field itself. This happened, e.g., in the study of syphilis when pathogens were identified for the first time at the beginning of the 20th century. Another possibility is that two different scientific paradigms (or schools of art) can coexist, such as Newtonian and Einsteinian mechanics; the decisive factor here is that both have a different frame of reference that is mutually exclusive (just as scientists often first have to develop a new ‘school of seeing’ and collect new data). Still another possibility is that two different paradigms exist side by side without the majority ratios changing. This is the case, for example, with the different interpretations of quantum mechanics: the stochastic or Copenhagen interpretation of quantum mechanics forms the majority opinion, while the deterministic or Bohmian theory represents a minority opinion. In the field of art, one can think of the overcoming of tonality and the development of the twelve-tone technique by avant-gardists such as Arnold Schönberg and Alban Berg. Although this technique was later taken up, it did not develop into the dominant method and never really became suitable for the masses (while tonality is still decisive for the majority of consumers today). Max Planck once commented ironically on the longevity of outdated scientific paradigms and their representatives with the words: “Science progresses with one funeral at a time.

The way in which Kuhn and Bourdieu conceptualise the processes of renewal in the fields of science and culture focuses primarily on the social processes associated with scientific or artistic revolutions. With regard to LLMs and the hopes associated with artificial general intelligence (AGI), this is instructive: due to its design, such ‘intelligence’ tends to repeat the majority opinion and thus to repeat the dominant paradigm (field of science) or it tends towards the commonplace, cliché, banal and inauthentic (field of art). This does not mean that intelligent machines cannot be used to create new paradigms. But they will not do so ‘by themselves’. Rather, it is clear that we have to view the seemingly overpowering AIs in the wider context of a socio-technical system in which humans still play a central role as agents – even if they are in the minority.

Human-Machine-Creativity

Language models that generate texts on the basis of probabilities are best approached with solid scepticism with regard to factual accuracy, and with a little humour. Jack Krawczyk, who is responsible for the development of the chatbot “Bard” at Google, openly admitted in March 2023: “Bard and ChatGPT are large language models, not knowledge models. They are great at generating human-sounding text, they are not good at ensuring their text is fact-based.” Calling a language model with a wry wink “bard” hits the nail on the head: bards write poetry, tell stories and don’t necessarily stick to the truth, as we know since Plato.

Creating texts, especially literary texts, was previously the preserve of humans. Large Language Models (LLMS), however, are surprisingly good at identifying and replicating literary styles and genres. So how can we imagine literary text production from now on? Terms such as “consciousness”, “memory”, “intentionality” and “creativity” are surprisingly poorly defined, both for humans and for machines. With regard to the latter, the British cognitive scientist Margaret A. Boden has already dealt with the differences between human and machine creativity in her book “The Creative Mind” – emphasising that machines only appear to be creative to a certain degree. She distinguishes between three forms of creativity: a) making unfamiliar combinations of familiar ideas; b) explorative creativity; and c) transformative creativity.

Producing unknown combinations from known ideas is certainly what LLMs are good at, because that is how they are built: calculating the most likely recombinations from the data available, following the patterns present in the data. It should therefore no longer be a great challenge for an LLM to fabricate a short story in 99 different styles and thus replicate Raymond Queneau’s famous “Exercises de Style“. Literary variations such as permutations, rhyme forms, jargons, narrative perspectives, sociolects, etc. should be producible with a single prompt. The phrase “a single prompt” reveals the vagueness of the term “intention”: A human must enter the prompt and thus act “intentionally”; the machine takes care of everything else.

The second form of creativity, according to Boden, explores conceptual spaces, which we can imagine as established genres in the field of literature. Genres follow rules that outline the space in which the literary action takes place; sociologist Pierre Bourdieu described them as the “rules of art”. Not everything is possible in every genre: while in crime fiction the dead do not come back to life or move around as living corpses, this is certainly possible in fantasy or horror literature. LLMs are able to identify such spaces of possibilities and replicate the patterns that characterise them. Especially when the underlying data contains many examples of literary genres such as historical novels, fantasy and romance novels along with their characteristic styles and topoi, LLMs can reliably produce recombinations and thus explore the conceptual space. Since these spaces offer many possibilities, not all of which are equally attractive to human readers, we can think of these combinatorial explorations as human-machine collaborations: A human develops a sketch of a novel and lets the outlined plot be formulated chapter by chapter by the machine. Such collaborations can be criticised from an economic rather than from an aesthetic perspective: In order to know the current space of possibilities, LLMs must also have access to material that is under copyright. When it comes to systems like ChatGPT, the data basis of which is not disclosed, this amounts to a privatisation of culture that was once public. And, to use an old argument: Here, human labour is being replaced by a machinery that enables the relevant companies to skim off the generated surplus.

The third form of creativity described by Margaret Boden aims to transform the conceptual space. Here, the rules that describe this space are broken and new ones are established. We can think, for example, of Marcel Duchamp’s urinal entitled “Fountain“, Picasso’s first cubist painting “Les Demoiselles d’Avignon” or Italo Calvino’s “Le città invisibili“. However, in order to redesign the conceptual space, you first have to know it and be able to name the rules that characterise it in order to be able to realise such a transformative work in collaboration with a machine. A LLM cannot achieve this, as such models do not reflect their own activity, lack generalisable word knowledge, and their heuristics are geared towards identifying patterns, but not towards creating new ones. This is where human and machine creativity separate: human creativity has knowledge of the world and a (possibly intuitive) knowledge of the rules of a conceptual space; in a movement of departure from known concepts new solutions are found, radical ideas are developed and new rules are established. Transformative creativity enables humans to create new works in collaboration with a machine; however, the intention to leave the known space of possibilities seems to be  (still) reserved for humans.

It’s the statistics, stupid

“It’s the statistics, stupid”, one could say when it comes to dealing with generative pretrained transformers (GPTs). Yet, we all still have to learn this, only one year after the presentation of ChatGPT. Statistical correlations are key to understanding how stochastic prediction models work and what they are capable of.

Put in simple terms, machine learning consists of showing a machine data on the basis of which it learns or memorises what belongs to what. This data is called the training data set. Once the machine has learnt the correlations, a test data set is presented to the model, i.e. data that it has not yet seen. The result can be used to measure how well a machine has learnt the correlations. The basic principle here is that probability models are trained from as much representative initial data as possible (i.e. examples) in order to be able to apply them to further unseen data. The quality of such a model therefore always depends on how voluminous and varied and of which quality the initial data used for training is.

Large language models (LLMs) are trained to write texts on specific topics, to provide answers to questions and to create the illusion of a dialogue. The machine is shown a large number of texts in which individual words are “masked” or hidden, for example: “It’s the [mask], stupid”. In response to the question: “What is this election about?”, the model then makes a prediction as to which word—based on the training data—would most likely be in the place of [mask], in this case “economy”. In principle, “deficit”, “money” or “statistics” could just as well be used here, but “economy” is by far the most common term in the training data and therefore the most likely word. The language model combines words that often appear in similar contexts in the training data set. The same applies to whole sentences or even longer texts.

However, the fact that LLMs predict probabilities has serious consequences. For example, the fact that the sentence predicted by a model is probable says nothing about whether this sentence is true or false. The generated texts may also contain misinformation such as outdated or false statements or fictions. Language models such as ChatGPT do not learn patterns that can be used to evaluate the truth of a statement. It is therefore the task of the people using the chatbot to check the credibility or truthfulness of the statement and to contextualise it. We should all learn how to do this, just as we learnt “back then” to check the reliability of a source presented as the result of a Google search. For some areas of life, the distinction between true and false is central, for example in science. A generative model that is able to produce scientific texts but cannot distinguish between true and false is therefore bound to fail—as was the case with the “Galactica” model presented by Meta, which was trained on the basis of 48 million scientific articles. Consequently, such a model will also raise questions about good scientific practice. Since science is essentially a system of references, the fact that generative models such as ChatGPT ‘fable’ references (i.e. generate a probable sequence of words) when in doubt is a real problem. It can therefore come as no surprise that the word ‘hallucinate’ has been named Word of the Year 2023 by the Cambridge Dictionary.

Furthermore, the truthfulness of facts depends on the context. This may sound strange at first. But even the banal question: “What is the capital of the Federal Republic of Germany?” shows that the answer can vary. Just over 30 years ago, “Bonn on the Rhine” would have been the correct answer. And the answer to the question “What is this election about?” would probably be different today than it was 30 years ago (spoiler suggestion: oligarchy vs. democracy). With regard to science, it becomes even more complex: the progress of scientific knowledge means that statements that were considered true and factual just a few decades ago are now considered outdated. Programming code also requires people to check the code generated by a generative model. This is the reason why one of the most important platforms for software developers, Stackoverflow, still does not allow answers generated by such models, as there is a realistic risk that they provide false or misleading information or malicious code. Large language models cannot verify the truth of a statement because, unlike humans, they do not have world knowledge and therefore cannot compare their output with the relevant context.

Beyond science and software development, a serious risk of language models in general is the creation of misinformation. If such models are used to generate (factually incorrect) content that is disseminated via social media or fills the comment fields of news sites, this can have serious consequences—they can increase polarisation and mistrust within a society or undermine shared basic convictions. This can have significant political consequences: In 2024, for example, new governments will be elected in the USA and India, and we can assume that these election campaigns will be largely decided by the content provided on social media. Is it the stupid statistics?

On the Use of ChatGPT in Cultural Heritage Institutions

Since the release of the ChatGPT dialogue system in November 2022, the societal debate about artificial intelligence (AI) has gained significant momentum and has also reached cultural heritage institutions (such as libraries, archives, and museums). The main challenge is to assess how powerful such large language models (LLMs) are in general, and Generative Pre-trained Transformers (GPTs) in particular. For the cultural heritage sector, the ChatGPT chat bot prototype reveals a whole range of possible uses: producing text summaries or descriptions of artworks, generating metadata, writing computer code for simple tasks, assisting with subject indexing and keyword indexing, or helping users find resources on the websites of cultural heritage institutions.

Undoubtedly, ChatGPT’s strengths lie in the generation of text and associated tasks. As “statistical parrots,” as these large language models were called in a much-discussed 2021 paper, these language models can predict on a stochastic basis what the next words of a snippet of text will look like. In this context, the ChatGPT use case has been trained – as a text-based dialogue system – to provide answers at any rate. This property of the chat bot points directly to one of the central weaknesses of the model: In case of doubt, ChatGPT provides untrue statements in order to maintain the dialogue. Since large language models are, after all, only applications of artificial intelligence and have no knowledge of the world, they cannot per se distinguish between fact and fiction, social construction and untruth. The fact that ChatGPT “hallucinates” (as the common anthropomorphizing term goes) when in doubt and also e.g., invents literature references, damages of course the reliability of the system – and it points to the great strength of libraries in providing authoritative evidence.

On the other hand, a strength of such systems is that they can excellently reproduce discourses and are therefore able to classify individual texts or larger text corpora and to describe their content in an outstanding way. This shows great potential, especially for libraries: Up to now, digital assistants that support the indexing of books have at best worked with statistical methods such as tf-idf, or with deep learning. Such approaches could be complemented through the use of topic modeling. The latter method generates a stochastically modelled text sequence that describes the content of a work or the topics it deals with. The challenge for users so far has been to interpret this collection of words and assign a coherent label to it – and this is exactly what ChatGPT does excellently, as several researchers have confirmed. Since this massively improves and facilitates the labelling of texts, this is certainly one of the most probable use cases for AI in libraries, and exactly the field on which the sub-project 3 “AI-supported content analysis and subject indexing” of the project “Human.Machine.Culture” focuses. By contrast, simple programming tasks such as creating a bibliographic record in a specific format or transforming a record from MARC.xml to JSON are in need of improvement; ChatGPT does not always perform such tasks reliably, as a recent experiment showed.

ChatGPT, as one of the most powerful text-based AI applications currently available, underlines the potential benefits of such models. At the same time, however, it also highlights the risks associated with the use of such applications: So far, only U.S.-American big tech companies are able to train such powerful models, make them accessible, and develop later onwards models optimized through reinforcement learning for specific tasks – with the clear goal of monetization. In addition, generative AI systems bring with them a number of ethical issues, as they require large masses of text that have so far been taken from the Internet and thus a place where not all people interact politely and with all etiquette. For example, a recent study has underlined that large language models reproduce stereotypes by associating the terms “Muslims” and “violence”. Moreover, toxic content in the language models have to be labeled as such, an operation that is being carried out by underpaid workers; this underlines again the ethical dubiousness of the process of establishing such models.

Finally, the fact that these models have been trained almost exclusively based on 21st century textual material available on the Internet has to be underlined. By contrast, sub-project 4 “Data provision and curation for AI” of the project “Human.Machine.Culture” concentrates on the provision of curated and historical data from libraries for AI applications. Finally, the deployment of large langage models points to very fundamental questions: Namely, what role the cultural heritage of all humanity should play in the future and what effect cultural heritage institutions like libraries, archives and museums may have on their establishment; and what influence the texts generated by large language models will have on our contemporary culture as such.