Tag Archive for: machine learning

Power Hungry Magic

“Any sufficiently advanced technology is indistinguishable from magic”, Arthur C. Clarke already knew, and it is part of the magic of new technologies that their downsides are systematically concealed. This is also the case with the energy consumption of large language models (LLMs): As with the schnitzel that ends up on consumers’ plates and makes them forget the relation to the realities of factory farming, so it is with the marvels of artificial intelligence. Information about the computing power required to create products such as ChatGPT and the big data used is not provided, either to avoid making data protection and copyright issues too obvious or to avoid having to quantify the energy consumption and CO2 emissions involved in training and operating these models. The reputable newspaper Die Zeit estimated in March 2023: “For the operation of ChatGPT, […] costs of 100,000 to 700,000 dollars a day are currently incurred” and noted “1,287 gigawatt hours of electricity” or “emissions of an estimated 502 tonnes of CO2” for the training of GPT-3 (Art. “Hidden energy”, in: Die Zeit No. 14, 30.03.2023, p.52). Against this backdrop, it comes as no surprise that, according to the International Energy Authority, the electricity consumption of the big tech companies Amazon, Microsoft, Google and Meta doubled to 72 TWh between 2017 and 2021; these four companies are also the world’s largest providers of commercially available cloud computing capacity.

Recently, Sasha Luccioni, Yacine Jernite and Emma Strubell presented the first systematic study on the energy consumption and CO2 emissions of various machine learning models during the inference phase. Inference here means the operation of the models, i.e. the period of deployment after training and fine-tuning the models. Inference accounts for around 80 to 90 percent of the costs of machine learning, on a cloud computing platform such as Amazon Web Services (AWS) around 90 per cent according to the operator. The study by Luccioni et al. emphasises the differences between various machine learning applications: The power and CO2 intensity is massively lower for text-based applications than for image-based tasks; similarly, it is massively lower for discriminative tasks than for generative ones, including generative pretrained transformers (GPTs). The differences between the various models are considerable: “For comparison, charging the average smartphone requires 0.012 kWh of energy which means that the most efficient text generation model uses as much energy as 16% of a full smartphone charge for 1,000 inferences, whereas the least efficient image generation model uses as much energy as 950 smartphone charges (11.49 kWh), or nearly 1 charge per image generation.” The larger the model, the faster the same amount of electricity is consumed or CO2 emitted during the inference phase as during the training phase.

Since ‘general purpose applications’ for the same task consume more energy than models that have been trained for a specific purpose, Luccioni et al. point out several trade-offs: Firstly, the trade-off between model size vs. power consumption, as the benefits of multi-purpose models must be weighed against their power costs and CO2 emissions. Secondly, the trade-off between accuracy/efficiency and electricity consumption across different models, because the higher the accuracy or the higher the efficiency of a model, the lower the power consumption of specific models, whereas multi-purpose models can fulfil many different tasks, but have a lower accuracy and higher electricity consumption. According to the authors, these empirically proven findings call into question, for example, whether it is really necessary to operate multi-purpose models such as Bard and Bing, i.e. they “do not see convincing evidence for the necessity of their deployment in contexts where tasks are well-defined, for instance web search and navigation, given these models’ energy requirements.”

The hunger for power of large general purpose models does not bring the “limits to growth” to the attention of leading entrepreneurs and investors of Western big tech companies, like the famous Club of Rome report more than 50 years ago. On the contrary, CEOs such as Jeff Bezos, whose empire also includes the largest cloud computing platform AWS, fear stagnation: “We will have to stop growing, which I think is a very bad future.” Visions such as the Metaverse are extremely costly in terms of resource consumption and emissions, and it is fair to ask whether AI applications will really be available to all of humanity in the future or only to those companies or individuals who can afford them. Nothing of all of this is even remotely sustainable. Given the growing power consumption of Western big tech companies and the fact that the core infrastructure for the development of AI products is already centralised by those few players, it remains unclear where the development of ‘magical’ AI applications will lead. Scientist Kate Crawford has given her own answer to this in her book “Atlas of AI“: Into space, because that’s where the resources are that these corporations need.

Feeding the Cuckoo

Large Language Models (LLMs) combine words that frequently appear in similar contexts in the training dataset; on this basis, they predict the most probable word or sentence. The larger the training dataset, the more possible combinations there are, and the more ‘creative’ the model appears. The sheer size of models such as GPT-4 already provides a competitive advantage that is hard to match: There are only a handful of companies in the world that can combine exorbitant computing power, availability of big data and an enormous market reach to create such a product. No research institutions are involved in the current competition, but the big tech companies Microsoft, Meta and Google are. However, few players and few models also mean a “race to the bottom” in terms of security and ethics, as the use of big data with regard to LLMs most often also means that the training data contains sensitive and confidential information as well as copyrighted material. In numerous court cases, the tech giants have been accused of collecting the data of millions of users online without their consent and violating copyright law in order to train AI models.

Internet users have therefore already helped to feed the cuckoo child. Google published this fact indirectly by updating its privacy policy in June 2023: “use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” Less well known, however, is the fact that the big tech companies also train their models, such as Bard, with what users entrust to them. In other words, everything you tell a chatbot can in turn be used as training material. In Google’s own words, it sounds like this: “Google uses this data to provide, improve, and develop Google products, services, and machine-learning technologies.” One consequence of the design of LLMs, however, is that the output of generative models cannot be controlled; there are simply too many possibilities with large models. If the LLM was and is trained on private or confidential data, this can lead to these data being disclosed and confidential information being revealed. Therefore, the training data should already comply with data protection regulations, which is why there are repeated calls for transparency with regard to training data.

Consequently, in its Bard Privacy Help Hub, Google warns users of the model not to feed it with sensitive data: “Please don’t enter confidential information in your Bard conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.” This is interesting insofar as the AI hype is fuelled by terms such as ‘disruption’, but at the same time it remains unclear what the business model with which big tech companies want to generate profits in the medium term should look like – and what exactly the use case should look like for average users. One use case, however, is the generation of texts that are needed on a daily basis, namely well-formulated application letters. However, if you upload your own CV for this purpose, you’re just feeding the cuckoo again. And that is not in our interest: After all, privacy is (also) a commons.

Human-Machine-Cognition

Humans search for themselves in non-human creatures and inanimate artefacts. Apes, the “next of kin”, or dogs, the “most faithful companions” are good examples of the former, robots are good examples of the latter: A human-like design of the robots’ bodies and a humanising linguistic framing of their capabilities supports, according to a common hypothesis, the anthropomorphisation of these machines and, as a consequence, the development of empathetic behaviour towards robots. The tendency to anthropomorphise varies from person to person; there are “stable individual differences in the tendency to attribute human-like attributes to nonhuman agents“.

Large Language Models (LLMs) are not (yet) associated with human-like body shapes. However, this does not mean that they are not subject to the human tendency to anthropomorphise. Even a well-formulated sentence can lead us to wrongly assume that it was spoken by a rational agent. Large language models are now excellently capable of reproducing human language. They have been trained on linguistic rules and patterns and have an excellent command of them. However, knowledge of the statistical regularities of language does not enable “understanding”. The ability to use language appropriately in a social context is also still incompletely developed in LLMs. They lack the necessary world knowledge, sensory access to the world and commonsense reasoning. The fact that we nevertheless tend to understand the text produced by generative pretrained language models (GPTs) as human utterances is on the one hand due to the fact that these language models have been trained on very large volumes of 21st century text and can therefore perfectly replicate our contemporary discourse. If the way in which meaning is produced through language corresponds to our everyday habits, then it can come as no surprise that we attribute “intelligence”, “intentionality” or even “identity” to the producer of a well-crafted text. In this respect, LLMs confirm the structuralist theories of the second half of the 20th century that language is a system that defines and limits the framework of what can be articulated and thus ultimately thought. And in this respect, LLMs also seem to confirm Roland Barthes’ thesis of the “death of the author”. The infinite recombination of the available word material and the prediction of the most probable words and sentences seem to be enough for us to recognise ourselves in the text output.

On the other hand, the specific design of chatbots supports anthropomorphisation. ChatGPT, for example, has been trained on tens of thousands of question-answer pairs. Instruction fine-tuning ensures that the model generates text sequences in a specific format. The LLM interprets the prompt as an instruction, distinguishes the input of the interlocutor or questioner from the text produced by itself and draws conclusions about the human participants. On the one hand, this means that the language model is capable to adapt the generated text to the human counterpart and to imitate sociolects; on the other hand, it creates in humans the cognitive illusion of a dialogue. The interface of apps such as ChatGPT further supports this illusion; it is designed like all the other interfaces used for human conversations. We humans then follow our habits and, in the dialogue with the chatbot, add the social context that is characteristic of a conversation and assume intentionality on the other side. Finally, ChatGPT was trained as a fictional character that provides answers in the first person. The language model therefore produces statements about itself, for example about its ethical and moral behaviour, its performance, privacy and the training data used. If a user asks for inappropriate output, the language model politely declines. These statements can therefore best be understood as an echo of the training process, as what OpenAI would like us to believe about this technology. The dialogue form and the fictional character reporting in the first person are the only ways in which OpenAI can control the output of the language model.

All of this can be summarised as “anthropomorphism by design”. It is therefore no wonder that we humans tend to ascribe human characteristics to a disembodied language model. However, while we are learning how to use such chatbots, we must not succumb to the illusion that we are dealing with a human interlocutor. Empathetic statements or emotions uttered by the bot are simulations that can become extremely problematic if we e.g. confuse the bot with a therapist. The assumption that a language model could be suitable for making decisions and therefore take on the role of lawyers, doctors or teachers is also misleading: in the end, it is still humans who take responsibility for such decisions. Therefore, we must not be tricked by an anthropomorphising design. The cognition that humans have anything other than a machine as counterpart is deceptive: there is no one there.

It’s the statistics, stupid

“It’s the statistics, stupid”, one could say when it comes to dealing with generative pretrained transformers (GPTs). Yet, we all still have to learn this, only one year after the presentation of ChatGPT. Statistical correlations are key to understanding how stochastic prediction models work and what they are capable of.

Put in simple terms, machine learning consists of showing a machine data on the basis of which it learns or memorises what belongs to what. This data is called the training data set. Once the machine has learnt the correlations, a test data set is presented to the model, i.e. data that it has not yet seen. The result can be used to measure how well a machine has learnt the correlations. The basic principle here is that probability models are trained from as much representative initial data as possible (i.e. examples) in order to be able to apply them to further unseen data. The quality of such a model therefore always depends on how voluminous and varied and of which quality the initial data used for training is.

Large language models (LLMs) are trained to write texts on specific topics, to provide answers to questions and to create the illusion of a dialogue. The machine is shown a large number of texts in which individual words are “masked” or hidden, for example: “It’s the [mask], stupid”. In response to the question: “What is this election about?”, the model then makes a prediction as to which word—based on the training data—would most likely be in the place of [mask], in this case “economy”. In principle, “deficit”, “money” or “statistics” could just as well be used here, but “economy” is by far the most common term in the training data and therefore the most likely word. The language model combines words that often appear in similar contexts in the training data set. The same applies to whole sentences or even longer texts.

However, the fact that LLMs predict probabilities has serious consequences. For example, the fact that the sentence predicted by a model is probable says nothing about whether this sentence is true or false. The generated texts may also contain misinformation such as outdated or false statements or fictions. Language models such as ChatGPT do not learn patterns that can be used to evaluate the truth of a statement. It is therefore the task of the people using the chatbot to check the credibility or truthfulness of the statement and to contextualise it. We should all learn how to do this, just as we learnt “back then” to check the reliability of a source presented as the result of a Google search. For some areas of life, the distinction between true and false is central, for example in science. A generative model that is able to produce scientific texts but cannot distinguish between true and false is therefore bound to fail—as was the case with the “Galactica” model presented by Meta, which was trained on the basis of 48 million scientific articles. Consequently, such a model will also raise questions about good scientific practice. Since science is essentially a system of references, the fact that generative models such as ChatGPT ‘fable’ references (i.e. generate a probable sequence of words) when in doubt is a real problem. It can therefore come as no surprise that the word ‘hallucinate’ has been named Word of the Year 2023 by the Cambridge Dictionary.

Furthermore, the truthfulness of facts depends on the context. This may sound strange at first. But even the banal question: “What is the capital of the Federal Republic of Germany?” shows that the answer can vary. Just over 30 years ago, “Bonn on the Rhine” would have been the correct answer. And the answer to the question “What is this election about?” would probably be different today than it was 30 years ago (spoiler suggestion: oligarchy vs. democracy). With regard to science, it becomes even more complex: the progress of scientific knowledge means that statements that were considered true and factual just a few decades ago are now considered outdated. Programming code also requires people to check the code generated by a generative model. This is the reason why one of the most important platforms for software developers, Stackoverflow, still does not allow answers generated by such models, as there is a realistic risk that they provide false or misleading information or malicious code. Large language models cannot verify the truth of a statement because, unlike humans, they do not have world knowledge and therefore cannot compare their output with the relevant context.

Beyond science and software development, a serious risk of language models in general is the creation of misinformation. If such models are used to generate (factually incorrect) content that is disseminated via social media or fills the comment fields of news sites, this can have serious consequences—they can increase polarisation and mistrust within a society or undermine shared basic convictions. This can have significant political consequences: In 2024, for example, new governments will be elected in the USA and India, and we can assume that these election campaigns will be largely decided by the content provided on social media. Is it the stupid statistics?