“It’s the statistics, stupid”, one could say when it comes to dealing with generative pretrained transformers (GPTs). Yet, we all still have to learn this, only one year after the presentation of ChatGPT. Statistical correlations are key to understanding how stochastic prediction models work and what they are capable of.
Put in simple terms, machine learning consists of showing a machine data on the basis of which it learns or memorises what belongs to what. This data is called the training data set. Once the machine has learnt the correlations, a test data set is presented to the model, i.e. data that it has not yet seen. The result can be used to measure how well a machine has learnt the correlations. The basic principle here is that probability models are trained from as much representative initial data as possible (i.e. examples) in order to be able to apply them to further unseen data. The quality of such a model therefore always depends on how voluminous and varied and of which quality the initial data used for training is.
Large language models (LLMs) are trained to write texts on specific topics, to provide answers to questions and to create the illusion of a dialogue. The machine is shown a large number of texts in which individual words are “masked” or hidden, for example: “It’s the [mask], stupid”. In response to the question: “What is this election about?”, the model then makes a prediction as to which word—based on the training data—would most likely be in the place of [mask], in this case “economy”. In principle, “deficit”, “money” or “statistics” could just as well be used here, but “economy” is by far the most common term in the training data and therefore the most likely word. The language model combines words that often appear in similar contexts in the training data set. The same applies to whole sentences or even longer texts.
However, the fact that LLMs predict probabilities has serious consequences. For example, the fact that the sentence predicted by a model is probable says nothing about whether this sentence is true or false. The generated texts may also contain misinformation such as outdated or false statements or fictions. Language models such as ChatGPT do not learn patterns that can be used to evaluate the truth of a statement. It is therefore the task of the people using the chatbot to check the credibility or truthfulness of the statement and to contextualise it. We should all learn how to do this, just as we learnt “back then” to check the reliability of a source presented as the result of a Google search. For some areas of life, the distinction between true and false is central, for example in science. A generative model that is able to produce scientific texts but cannot distinguish between true and false is therefore bound to fail—as was the case with the “Galactica” model presented by Meta, which was trained on the basis of 48 million scientific articles. Consequently, such a model will also raise questions about good scientific practice. Since science is essentially a system of references, the fact that generative models such as ChatGPT ‘fable’ references (i.e. generate a probable sequence of words) when in doubt is a real problem. It can therefore come as no surprise that the word ‘hallucinate’ has been named Word of the Year 2023 by the Cambridge Dictionary.
Furthermore, the truthfulness of facts depends on the context. This may sound strange at first. But even the banal question: “What is the capital of the Federal Republic of Germany?” shows that the answer can vary. Just over 30 years ago, “Bonn on the Rhine” would have been the correct answer. And the answer to the question “What is this election about?” would probably be different today than it was 30 years ago (spoiler suggestion: oligarchy vs. democracy). With regard to science, it becomes even more complex: the progress of scientific knowledge means that statements that were considered true and factual just a few decades ago are now considered outdated. Programming code also requires people to check the code generated by a generative model. This is the reason why one of the most important platforms for software developers, Stackoverflow, still does not allow answers generated by such models, as there is a realistic risk that they provide false or misleading information or malicious code. Large language models cannot verify the truth of a statement because, unlike humans, they do not have world knowledge and therefore cannot compare their output with the relevant context.
Beyond science and software development, a serious risk of language models in general is the creation of misinformation. If such models are used to generate (factually incorrect) content that is disseminated via social media or fills the comment fields of news sites, this can have serious consequences—they can increase polarisation and mistrust within a society or undermine shared basic convictions. This can have significant political consequences: In 2024, for example, new governments will be elected in the USA and India, and we can assume that these election campaigns will be largely decided by the content provided on social media. Is it the stupid statistics?