Tag Archive for: machine learning

On Objectivity – and the Bridge to Truth

Statistics are held in high regard. Although the saying goes like “do not trust any statistics you did not fake yourself”, they are often regarded as a prime example of objectivity based on the foundation of large datasets. This view is taken to the extreme when it comes to machine learning: machine learning models are statistical learners. A recently published research article criticizes this view: “the mythology surrounding ML presents it—and justifies its usage in said contexts over the status quo of human decision-making—as paradigmatically objective in the sense of being free from the influence of human values” (Andrews et al. 2024).

The fact that machine learning is seen as an extreme case of objectivity has its origins in the 19th century. Back then, the foundations of our current understanding of objectivity were laid. Human (and fallible) subjectivity was contrasted with mechanical objectivity. At that time, machines were considered to be free from willful intervention, which was seen as the most dangerous aspect of subjectivity (Daston / Galison 2007). To this day, machines – be they cameras, sensors or electronic devices, or even the data they produce – have become emblematic for the elimination of human agency and the embodiment of objectivity without subjectivity. These perceptions persist, and it becomes necessary to explain why common sense continues to attribute objectivity and impartiality to data, statistics and machine learning.

The debate of the 19th century has its revenant today in the discussion about biases. The fact that every dataset contains statistical distortions is obviously not compatible with the attribution of objectivity, which is supposed to be inherent in particular in large datasets. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size becomes more important. On the other hand, “large” does not mean “all”; rather, one must be aware of the universe covered by the data. Statistical inference, i.e. conclusions drawn from data about the population as a whole, cannot be easily applied because the datasets are not established to ensure representativeness (Kitchin 2019). A recent article states with regard to biases: “Data bias has been defined as ‘a systematic distortion in the data’ that can be measured by ‘contrasting a working data sample with reference samples drawn from different sources or contexts.’ This definition encodes an important premise: that there is an absolute truth value in data and that bias is just a ‘distortion’ from that value. This key premise broadly motivates approaches to ‘debias’ data and ML systems” (Miceli et al. 2022). What sounds like objectivity and ‘absolute truth value’ because it is based on large datasets, statistics and machine learning models is not necessarily correct, because if the model is a poor representation of reality, the conclusions drawn from the results may be wrong. This is also the reason why Cathy O’Neil in 2016 described an algorithm as “an opinion formalized in code” – it does not simply offer objectivity, but works towards the purposes and goals for which it was written.

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty)

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty). Staatliche Museen zu Berlin, Egyptian Museum und Papyrus Collection. Public Domain Mark 1.0
A historical visualisation of scientists communicating with each other and harmonising their views in the sense of the community being above the individuum?

The fact that scientists – and the machine learning community in particular – still adhere to the concept of objectivity and the objective nature of scientific knowledge is due to the fact that the latter is socially constructed because it is partly derived from collective beliefs held by scientific communities (Fleck 1935/1980). Beyond the activity of the individual researcher, the embedding of research results within a broader scientific discourse shows that scientific research is a collective activity. Much of what is termed ‘science’ is based on social practices and procedures of adjudication. As the historian of science Naomi Oreskes noted in 2019, the heterogeneity of the scientific community paradoxically supports the strength of the achieved consensus: “Objectivity is likely to be maximized when […] the community is sufficiently diverse that a broad range of views can be developed, heard, and appropriately considered.” This was obviously also clear to Miceli et al. when they took position in the debate on biases: “data never represents an absolute truth. Data, just like truth, is the product of subjective and asymmetrical social relations.” Ultimately, the processes that take place within such scientific communities lead to what is referred to as scientific truth. Data, statistics, machine learning and objectivity are embedded in social discourses, and in the last instance, the latter form the bridge to truth.

Openness, Efficiency and Closed Infrastructures

The concept of data spaces, that the European Commission is pursuing, is not only a technical one; it also implies a political constitution. Data spaces such as GAIA-X do not require centralised management. The operation of such a data space can take place within a federation that establishes the means to control data integrity and data trustworthiness. The federation that operates the data space is therefore more like the European Union (i.e. a federation of states) than like a centralised democracy. And trust is not only something that characterises cultural heritage institutions in terms of data and machine learning models. Such institutions fulfil their mission on the basis of the trust that people place in them, a trust that has grown over decades or centuries and is an expression of people’s conviction that these renowned and time-honoured institutions make the right decisions and, for example, make the right choices when acquiring their objects.

The political concept of data spaces thus stands in clear contrast to the hierarchical and opaque structures of big tech companies. With regard to data and machine learning models, a clear centralisation movement can be observed in the relevant corporations (Alphabet, Meta, Amazon, Microsoft) since the 2010s, particularly with regard to research and development and the provision of infrastructure. A study published in 2022 on the values that are central to machine learning research emphasises two insights: firstly, the presence of large tech companies in the 100 most-cited studies published in the two most influential machine learning conferences is massively increasing. “For example, in 2008/09, 24% of these top cited papers had corporate affiliated authors, and in 2018/19 this statistic more than doubled, to 55%. Moreover, of these corporations connected to influential papers, the presence of “big-tech” firms, such as Google and Microsoft, more than tripled from 21% to 66%.” This means that tech companies are almost as frequently involved in the most important research as the most important universities. Putting the consequences of this privatisation of research for the distribution of knowledge production in Western societies into perspective would be worthy of its own studies. On the other hand, the study by Birhane et al. emphasises a value that is repeatedly highlighted in the 100 examined research articles: Efficiency. The praise of efficiency is in this case not neutral, as it favours those institutions that are able to process constantly growing amounts of data and procure and deploy the necessary resources. In other words, emphasising a technical-sounding value such as efficiency “facilitates and encourages the most powerful actors to scale up their computation to ever higher orders of magnitude, making their models even less accessible to those without resources to use them and decreasing the ability to compete with them.”

Feigned door of Sokarhotep, symbolising the feigned openness of AI applications provided by big tech

Feigned door of Sokarhotep, Old Kingdom, 5. Dynasty. Ägyptisches Museum und Papyrussammlung. CC BY-SA 4.0.
The feigned door of Sokarhotep symbolises the feigned openness of AI applications provided by big tech

This already addresses the second aspect, the power of disposal over infrastructure. There is no doubt that there is already a “compute divide” between the big tech companies and e.g. elite universities. Research and development in the field of machine learning is currently highly dependent on the infrastructure provided by a small number of actors. This situation also has an impact on the open provision of models. When openness becomes a question of access to resources, scale becomes a problem for openness: Truly open AI systems are not possible if the resources needed to build them from scratch and deploy them on a large scale remain closed because they are only available to those who have these significant resources at their disposal. And these are almost always corporations. A recently published study on the concentration of power and the political economy of open AI therefore concludes that open source and centralisation are mutually exclusive: “only a few large tech corporations can create and deploy large AI systems at scale, from start to finish – a far cry from the decentralized and modifiable infrastructure that once animated the dream of the free/open source software movement”. A company name like “OpenAI” thus becomes an oxymoron.

Against this backdrop, it becomes clear that the European concept of data spaces represents a counter-movement to the monopolistic structures of tech companies. The openness, data sovereignty and trustworthiness that these data spaces represent will not open up the possibility of building infrastructures that can compete with those of the big tech companies. However, they will make it possible to develop specific models with clearly defined tasks that work more efficiently than the general-purpose applications developed by the tech companies. In this way, the value of efficiency, which is central to the field of machine learning, could be recoded.

Power Hungry Magic

“Any sufficiently advanced technology is indistinguishable from magic”, Arthur C. Clarke already knew, and it is part of the magic of new technologies that their downsides are systematically concealed. This is also the case with the energy consumption of large language models (LLMs): As with the schnitzel that ends up on consumers’ plates and makes them forget the relation to the realities of factory farming, so it is with the marvels of artificial intelligence. Information about the computing power required to create products such as ChatGPT and the big data used is not provided, either to avoid making data protection and copyright issues too obvious or to avoid having to quantify the energy consumption and CO2 emissions involved in training and operating these models. The reputable newspaper Die Zeit estimated in March 2023: “For the operation of ChatGPT, […] costs of 100,000 to 700,000 dollars a day are currently incurred” and noted “1,287 gigawatt hours of electricity” or “emissions of an estimated 502 tonnes of CO2” for the training of GPT-3 (Art. “Hidden energy”, in: Die Zeit No. 14, 30.03.2023, p.52). Against this backdrop, it comes as no surprise that, according to the International Energy Authority, the electricity consumption of the big tech companies Amazon, Microsoft, Google and Meta doubled to 72 TWh between 2017 and 2021; these four companies are also the world’s largest providers of commercially available cloud computing capacity.

Recently, Sasha Luccioni, Yacine Jernite and Emma Strubell presented the first systematic study on the energy consumption and CO2 emissions of various machine learning models during the inference phase. Inference here means the operation of the models, i.e. the period of deployment after training and fine-tuning the models. Inference accounts for around 80 to 90 percent of the costs of machine learning, on a cloud computing platform such as Amazon Web Services (AWS) around 90 per cent according to the operator. The study by Luccioni et al. emphasises the differences between various machine learning applications: The power and CO2 intensity is massively lower for text-based applications than for image-based tasks; similarly, it is massively lower for discriminative tasks than for generative ones, including generative pretrained transformers (GPTs). The differences between the various models are considerable: “For comparison, charging the average smartphone requires 0.012 kWh of energy which means that the most efficient text generation model uses as much energy as 16% of a full smartphone charge for 1,000 inferences, whereas the least efficient image generation model uses as much energy as 950 smartphone charges (11.49 kWh), or nearly 1 charge per image generation.” The larger the model, the faster the same amount of electricity is consumed or CO2 emitted during the inference phase as during the training phase.

Since ‘general purpose applications’ for the same task consume more energy than models that have been trained for a specific purpose, Luccioni et al. point out several trade-offs: Firstly, the trade-off between model size vs. power consumption, as the benefits of multi-purpose models must be weighed against their power costs and CO2 emissions. Secondly, the trade-off between accuracy/efficiency and electricity consumption across different models, because the higher the accuracy or the higher the efficiency of a model, the lower the power consumption of specific models, whereas multi-purpose models can fulfil many different tasks, but have a lower accuracy and higher electricity consumption. According to the authors, these empirically proven findings call into question, for example, whether it is really necessary to operate multi-purpose models such as Bard and Bing, i.e. they “do not see convincing evidence for the necessity of their deployment in contexts where tasks are well-defined, for instance web search and navigation, given these models’ energy requirements.”

The hunger for power of large general purpose models does not bring the “limits to growth” to the attention of leading entrepreneurs and investors of Western big tech companies, like the famous Club of Rome report more than 50 years ago. On the contrary, CEOs such as Jeff Bezos, whose empire also includes the largest cloud computing platform AWS, fear stagnation: “We will have to stop growing, which I think is a very bad future.” Visions such as the Metaverse are extremely costly in terms of resource consumption and emissions, and it is fair to ask whether AI applications will really be available to all of humanity in the future or only to those companies or individuals who can afford them. Nothing of all of this is even remotely sustainable. Given the growing power consumption of Western big tech companies and the fact that the core infrastructure for the development of AI products is already centralised by those few players, it remains unclear where the development of ‘magical’ AI applications will lead. Scientist Kate Crawford has given her own answer to this in her book “Atlas of AI“: Into space, because that’s where the resources are that these corporations need.

Feeding the Cuckoo

Large Language Models (LLMs) combine words that frequently appear in similar contexts in the training dataset; on this basis, they predict the most probable word or sentence. The larger the training dataset, the more possible combinations there are, and the more ‘creative’ the model appears. The sheer size of models such as GPT-4 already provides a competitive advantage that is hard to match: There are only a handful of companies in the world that can combine exorbitant computing power, availability of big data and an enormous market reach to create such a product. No research institutions are involved in the current competition, but the big tech companies Microsoft, Meta and Google are. However, few players and few models also mean a “race to the bottom” in terms of security and ethics, as the use of big data with regard to LLMs most often also means that the training data contains sensitive and confidential information as well as copyrighted material. In numerous court cases, the tech giants have been accused of collecting the data of millions of users online without their consent and violating copyright law in order to train AI models.

Internet users have therefore already helped to feed the cuckoo child. Google published this fact indirectly by updating its privacy policy in June 2023: “use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” Less well known, however, is the fact that the big tech companies also train their models, such as Bard, with what users entrust to them. In other words, everything you tell a chatbot can in turn be used as training material. In Google’s own words, it sounds like this: “Google uses this data to provide, improve, and develop Google products, services, and machine-learning technologies.” One consequence of the design of LLMs, however, is that the output of generative models cannot be controlled; there are simply too many possibilities with large models. If the LLM was and is trained on private or confidential data, this can lead to these data being disclosed and confidential information being revealed. Therefore, the training data should already comply with data protection regulations, which is why there are repeated calls for transparency with regard to training data.

Consequently, in its Bard Privacy Help Hub, Google warns users of the model not to feed it with sensitive data: “Please don’t enter confidential information in your Bard conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.” This is interesting insofar as the AI hype is fuelled by terms such as ‘disruption’, but at the same time it remains unclear what the business model with which big tech companies want to generate profits in the medium term should look like – and what exactly the use case should look like for average users. One use case, however, is the generation of texts that are needed on a daily basis, namely well-formulated application letters. However, if you upload your own CV for this purpose, you’re just feeding the cuckoo again. And that is not in our interest: After all, privacy is (also) a commons.

Human-Machine-Cognition

Humans search for themselves in non-human creatures and inanimate artefacts. Apes, the “next of kin”, or dogs, the “most faithful companions” are good examples of the former, robots are good examples of the latter: A human-like design of the robots’ bodies and a humanising linguistic framing of their capabilities supports, according to a common hypothesis, the anthropomorphisation of these machines and, as a consequence, the development of empathetic behaviour towards robots. The tendency to anthropomorphise varies from person to person; there are “stable individual differences in the tendency to attribute human-like attributes to nonhuman agents“.

Large Language Models (LLMs) are not (yet) associated with human-like body shapes. However, this does not mean that they are not subject to the human tendency to anthropomorphise. Even a well-formulated sentence can lead us to wrongly assume that it was spoken by a rational agent. Large language models are now excellently capable of reproducing human language. They have been trained on linguistic rules and patterns and have an excellent command of them. However, knowledge of the statistical regularities of language does not enable “understanding”. The ability to use language appropriately in a social context is also still incompletely developed in LLMs. They lack the necessary world knowledge, sensory access to the world and commonsense reasoning. The fact that we nevertheless tend to understand the text produced by generative pretrained language models (GPTs) as human utterances is on the one hand due to the fact that these language models have been trained on very large volumes of 21st century text and can therefore perfectly replicate our contemporary discourse. If the way in which meaning is produced through language corresponds to our everyday habits, then it can come as no surprise that we attribute “intelligence”, “intentionality” or even “identity” to the producer of a well-crafted text. In this respect, LLMs confirm the structuralist theories of the second half of the 20th century that language is a system that defines and limits the framework of what can be articulated and thus ultimately thought. And in this respect, LLMs also seem to confirm Roland Barthes’ thesis of the “death of the author”. The infinite recombination of the available word material and the prediction of the most probable words and sentences seem to be enough for us to recognise ourselves in the text output.

On the other hand, the specific design of chatbots supports anthropomorphisation. ChatGPT, for example, has been trained on tens of thousands of question-answer pairs. Instruction fine-tuning ensures that the model generates text sequences in a specific format. The LLM interprets the prompt as an instruction, distinguishes the input of the interlocutor or questioner from the text produced by itself and draws conclusions about the human participants. On the one hand, this means that the language model is capable to adapt the generated text to the human counterpart and to imitate sociolects; on the other hand, it creates in humans the cognitive illusion of a dialogue. The interface of apps such as ChatGPT further supports this illusion; it is designed like all the other interfaces used for human conversations. We humans then follow our habits and, in the dialogue with the chatbot, add the social context that is characteristic of a conversation and assume intentionality on the other side. Finally, ChatGPT was trained as a fictional character that provides answers in the first person. The language model therefore produces statements about itself, for example about its ethical and moral behaviour, its performance, privacy and the training data used. If a user asks for inappropriate output, the language model politely declines. These statements can therefore best be understood as an echo of the training process, as what OpenAI would like us to believe about this technology. The dialogue form and the fictional character reporting in the first person are the only ways in which OpenAI can control the output of the language model.

All of this can be summarised as “anthropomorphism by design”. It is therefore no wonder that we humans tend to ascribe human characteristics to a disembodied language model. However, while we are learning how to use such chatbots, we must not succumb to the illusion that we are dealing with a human interlocutor. Empathetic statements or emotions uttered by the bot are simulations that can become extremely problematic if we e.g. confuse the bot with a therapist. The assumption that a language model could be suitable for making decisions and therefore take on the role of lawyers, doctors or teachers is also misleading: in the end, it is still humans who take responsibility for such decisions. Therefore, we must not be tricked by an anthropomorphising design. The cognition that humans have anything other than a machine as counterpart is deceptive: there is no one there.

It’s the statistics, stupid

“It’s the statistics, stupid”, one could say when it comes to dealing with generative pretrained transformers (GPTs). Yet, we all still have to learn this, only one year after the presentation of ChatGPT. Statistical correlations are key to understanding how stochastic prediction models work and what they are capable of.

Put in simple terms, machine learning consists of showing a machine data on the basis of which it learns or memorises what belongs to what. This data is called the training data set. Once the machine has learnt the correlations, a test data set is presented to the model, i.e. data that it has not yet seen. The result can be used to measure how well a machine has learnt the correlations. The basic principle here is that probability models are trained from as much representative initial data as possible (i.e. examples) in order to be able to apply them to further unseen data. The quality of such a model therefore always depends on how voluminous and varied and of which quality the initial data used for training is.

Large language models (LLMs) are trained to write texts on specific topics, to provide answers to questions and to create the illusion of a dialogue. The machine is shown a large number of texts in which individual words are “masked” or hidden, for example: “It’s the [mask], stupid”. In response to the question: “What is this election about?”, the model then makes a prediction as to which word—based on the training data—would most likely be in the place of [mask], in this case “economy”. In principle, “deficit”, “money” or “statistics” could just as well be used here, but “economy” is by far the most common term in the training data and therefore the most likely word. The language model combines words that often appear in similar contexts in the training data set. The same applies to whole sentences or even longer texts.

However, the fact that LLMs predict probabilities has serious consequences. For example, the fact that the sentence predicted by a model is probable says nothing about whether this sentence is true or false. The generated texts may also contain misinformation such as outdated or false statements or fictions. Language models such as ChatGPT do not learn patterns that can be used to evaluate the truth of a statement. It is therefore the task of the people using the chatbot to check the credibility or truthfulness of the statement and to contextualise it. We should all learn how to do this, just as we learnt “back then” to check the reliability of a source presented as the result of a Google search. For some areas of life, the distinction between true and false is central, for example in science. A generative model that is able to produce scientific texts but cannot distinguish between true and false is therefore bound to fail—as was the case with the “Galactica” model presented by Meta, which was trained on the basis of 48 million scientific articles. Consequently, such a model will also raise questions about good scientific practice. Since science is essentially a system of references, the fact that generative models such as ChatGPT ‘fable’ references (i.e. generate a probable sequence of words) when in doubt is a real problem. It can therefore come as no surprise that the word ‘hallucinate’ has been named Word of the Year 2023 by the Cambridge Dictionary.

Furthermore, the truthfulness of facts depends on the context. This may sound strange at first. But even the banal question: “What is the capital of the Federal Republic of Germany?” shows that the answer can vary. Just over 30 years ago, “Bonn on the Rhine” would have been the correct answer. And the answer to the question “What is this election about?” would probably be different today than it was 30 years ago (spoiler suggestion: oligarchy vs. democracy). With regard to science, it becomes even more complex: the progress of scientific knowledge means that statements that were considered true and factual just a few decades ago are now considered outdated. Programming code also requires people to check the code generated by a generative model. This is the reason why one of the most important platforms for software developers, Stackoverflow, still does not allow answers generated by such models, as there is a realistic risk that they provide false or misleading information or malicious code. Large language models cannot verify the truth of a statement because, unlike humans, they do not have world knowledge and therefore cannot compare their output with the relevant context.

Beyond science and software development, a serious risk of language models in general is the creation of misinformation. If such models are used to generate (factually incorrect) content that is disseminated via social media or fills the comment fields of news sites, this can have serious consequences—they can increase polarisation and mistrust within a society or undermine shared basic convictions. This can have significant political consequences: In 2024, for example, new governments will be elected in the USA and India, and we can assume that these election campaigns will be largely decided by the content provided on social media. Is it the stupid statistics?