Tag Archive for: big data

On Objectivity – and the Bridge to Truth

Statistics are held in high regard. Although the saying goes like “do not trust any statistics you did not fake yourself”, they are often regarded as a prime example of objectivity based on the foundation of large datasets. This view is taken to the extreme when it comes to machine learning: machine learning models are statistical learners. A recently published research article criticizes this view: “the mythology surrounding ML presents it—and justifies its usage in said contexts over the status quo of human decision-making—as paradigmatically objective in the sense of being free from the influence of human values” (Andrews et al. 2024).

The fact that machine learning is seen as an extreme case of objectivity has its origins in the 19th century. Back then, the foundations of our current understanding of objectivity were laid. Human (and fallible) subjectivity was contrasted with mechanical objectivity. At that time, machines were considered to be free from willful intervention, which was seen as the most dangerous aspect of subjectivity (Daston / Galison 2007). To this day, machines – be they cameras, sensors or electronic devices, or even the data they produce – have become emblematic for the elimination of human agency and the embodiment of objectivity without subjectivity. These perceptions persist, and it becomes necessary to explain why common sense continues to attribute objectivity and impartiality to data, statistics and machine learning.

The debate of the 19th century has its revenant today in the discussion about biases. The fact that every dataset contains statistical distortions is obviously not compatible with the attribution of objectivity, which is supposed to be inherent in particular in large datasets. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size becomes more important. On the other hand, “large” does not mean “all”; rather, one must be aware of the universe covered by the data. Statistical inference, i.e. conclusions drawn from data about the population as a whole, cannot be easily applied because the datasets are not established to ensure representativeness (Kitchin 2019). A recent article states with regard to biases: “Data bias has been defined as ‘a systematic distortion in the data’ that can be measured by ‘contrasting a working data sample with reference samples drawn from different sources or contexts.’ This definition encodes an important premise: that there is an absolute truth value in data and that bias is just a ‘distortion’ from that value. This key premise broadly motivates approaches to ‘debias’ data and ML systems” (Miceli et al. 2022). What sounds like objectivity and ‘absolute truth value’ because it is based on large datasets, statistics and machine learning models is not necessarily correct, because if the model is a poor representation of reality, the conclusions drawn from the results may be wrong. This is also the reason why Cathy O’Neil in 2016 described an algorithm as “an opinion formalized in code” – it does not simply offer objectivity, but works towards the purposes and goals for which it was written.

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty)

Relief fragment with depiction of rowers, Hatschepsut (Queen, Ancient Egypt, 18. Dynasty). Staatliche Museen zu Berlin, Egyptian Museum und Papyrus Collection. Public Domain Mark 1.0
A historical visualisation of scientists communicating with each other and harmonising their views in the sense of the community being above the individuum?

The fact that scientists – and the machine learning community in particular – still adhere to the concept of objectivity and the objective nature of scientific knowledge is due to the fact that the latter is socially constructed because it is partly derived from collective beliefs held by scientific communities (Fleck 1935/1980). Beyond the activity of the individual researcher, the embedding of research results within a broader scientific discourse shows that scientific research is a collective activity. Much of what is termed ‘science’ is based on social practices and procedures of adjudication. As the historian of science Naomi Oreskes noted in 2019, the heterogeneity of the scientific community paradoxically supports the strength of the achieved consensus: “Objectivity is likely to be maximized when […] the community is sufficiently diverse that a broad range of views can be developed, heard, and appropriately considered.” This was obviously also clear to Miceli et al. when they took position in the debate on biases: “data never represents an absolute truth. Data, just like truth, is the product of subjective and asymmetrical social relations.” Ultimately, the processes that take place within such scientific communities lead to what is referred to as scientific truth. Data, statistics, machine learning and objectivity are embedded in social discourses, and in the last instance, the latter form the bridge to truth.

Openness, and Some of its Shades

Openness, that lighthouse of the 20th century, came along with open interfaces (APIs). In the case of galleries, libraries, archives, and museums (GLAMs), it was the Open Archives Initiative Protocol for Metadata Harvesting, or OAI-PMH for short. At the time, the idea was to provide an interface that makes metadata available in interoperable formats and thus enables exchange between different institutions. In addition, the harvesting of distributed resources described in XML format is made possible, which may be restricted to named sets defined by the provider. The objects are referenced via URLs in the metadata; this also facilitates access to the objects themselves. Basically, the protocol is not designed to differentiate between users; licences and rights statements can be included, but it was not foreseen to mask specific material from access: The decision whether or not (and which) use would be made from material protected by intellectual property rights in the end lies with the users.

Lighthouse on the Breton coast, painting by Théodore Gudin, 1845

Lighthouse on the Breton coast, painting by Théodore Gudin, 1845. Staatliche Museen zu Berlin, Nationalgalerie. Public Domain Mark 1.0

The 21st century brought a new concept: Data sovereignty. This implies, on the one hand, that data are subject to the laws and governance structures that apply in the jurisdiction where the data are hosted; and for the hosts, on the other hand, the concept stands for the notion that rights holders can determine themselves what third parties may and can do with the data. With regard to the situation that there is now a second lighthouse – provision of cultural heritage data sets for innovation and research – providing orientation in troubled times, the role of cultural heritage institutions as access brokers becomes tangible: If rights holders do not wish to provide their (IPR protected) data openly to commercial AI companies, GLAM institutions as data providers are in the position to negotiate differentiations in the use of these data. For example, they may be used freely by startups, small and medium-sized enterprises (SMEs) and companies active in the cultural sector, while for big tech this could involve fees. Interestingly, the European Data Governance Act foresees such a case and includes a relevant set of instruments. There is a chapter on the use of data provided by public sector bodies (Chapter II, Article 6), which regulates the provision of data in exchange for fees and allows for the differentiation of the fees to be charged between private users, SMEs and startups on the one hand, and larger corporations on the other, which don’t fall under the former condition. In this way, a possibility for differentiation within the framework of commercial users is created, whereby the fees have to be oriented at the costs of the infrastructure to provide data. For these cases, cultural heritage institutions need new licences (or rights statements), clarifying whether or not commercial enterprises are excluded from the access to data based on the opt-out option of the rights holders; and clarifying whether or not big tech corporations get access by paying fees while data are provided free of cost to start-ups and and SMEs.

While this describes the legal side of the role of GLAM institutions as access brokers, there is also a technical side to data sovereignty, addressed by “data spaces”. APIs like OAI-PMH will continue to ensure the exchange between institutions, but will lose in importance in terms of data provision for third parties (apart from the provision of material which is in the public domain). By contrast, the concept of data spaces, which is of central importance for the European Commission’s policy for the upcoming years, will gain in importance. One planned data space is, e.g., the European Data Space for Cultural Heritage, which is to be created in collaboration with Europeana; existing similar initiatives include the European Open Science Cloud (EOSC) and the European Collaborative Cloud for Cultural Heritage (ECCCH). A technical implementation of such a data space is GAIA-X, a European initiative for an independent cloud infrastructure. Amongst other functionalities, it enables GLAM institutions to keep their data on premise while delivering processed data to users of the infrastructure after applying an algorithm of their choice to the data held by the cultural heritage institution: Instead of downloading terabytes of data and processing them on their own, the algorithm (or machine learning model) can be selected and sent to the data. An example providing such functionalities has been developed by Berlin State Library with the CrossAsia Demonstrator. Such an infrastructure does not only enable the handling of data with various rights of use, but also allows a differentiation between users as well as payment services. In other words: It grants full sovereignty over the data. As with all technical solutions, there is a downside: Such data spaces are usually complex and difficult to manage, which entails an obstacle for cultural heritage institutions, and often results in the need for additional manpower.

Linked (but not bound) to the concepts of data spaces and data sovereignty is the idea of a commons. “Commons” designates a shared resource that is managed by a community for the benefit of its members. Europeana, the meta-aggregator and web portal for the digital collection of European cultural heritage, explicitly conceptualises the planned European Data Space for Cultural Heritage as “an open and resilient commons for the users of European cultural data, where data owners – as opposed to platforms – have control of their data and of how, when and with whom it is shared“. The formulation chosen here is indicative of a learning process with regard to openness: defining an open commons “as opposed to platforms” addresses an issue which is characteristic of open commons, namely the over-use of the available resources which may lead to their depletion. In the classical examples of commons like fishing grounds or pasture, the resource is endangered if users try to profit from it without contributing at the same time to its preservation. However, this is not the case with digital resources. Rather, the issue lies with the potential loss of communal benefits due to actions motivated by self-interest. In the 21st century, the rise of the big platforms has revealed what has been termed “the paradox of open”: “open resources are most likely to contribute to the power of those with the best means to make use of them“. The need for data spaces managed by a community for the benefit of its members does not only add another shade to openness; at the same time, it opens up another front – the turn against platformization implies a rejection of the dominance of non-European big tech companies.

Power Hungry Magic

“Any sufficiently advanced technology is indistinguishable from magic”, Arthur C. Clarke already knew, and it is part of the magic of new technologies that their downsides are systematically concealed. This is also the case with the energy consumption of large language models (LLMs): As with the schnitzel that ends up on consumers’ plates and makes them forget the relation to the realities of factory farming, so it is with the marvels of artificial intelligence. Information about the computing power required to create products such as ChatGPT and the big data used is not provided, either to avoid making data protection and copyright issues too obvious or to avoid having to quantify the energy consumption and CO2 emissions involved in training and operating these models. The reputable newspaper Die Zeit estimated in March 2023: “For the operation of ChatGPT, […] costs of 100,000 to 700,000 dollars a day are currently incurred” and noted “1,287 gigawatt hours of electricity” or “emissions of an estimated 502 tonnes of CO2” for the training of GPT-3 (Art. “Hidden energy”, in: Die Zeit No. 14, 30.03.2023, p.52). Against this backdrop, it comes as no surprise that, according to the International Energy Authority, the electricity consumption of the big tech companies Amazon, Microsoft, Google and Meta doubled to 72 TWh between 2017 and 2021; these four companies are also the world’s largest providers of commercially available cloud computing capacity.

Recently, Sasha Luccioni, Yacine Jernite and Emma Strubell presented the first systematic study on the energy consumption and CO2 emissions of various machine learning models during the inference phase. Inference here means the operation of the models, i.e. the period of deployment after training and fine-tuning the models. Inference accounts for around 80 to 90 percent of the costs of machine learning, on a cloud computing platform such as Amazon Web Services (AWS) around 90 per cent according to the operator. The study by Luccioni et al. emphasises the differences between various machine learning applications: The power and CO2 intensity is massively lower for text-based applications than for image-based tasks; similarly, it is massively lower for discriminative tasks than for generative ones, including generative pretrained transformers (GPTs). The differences between the various models are considerable: “For comparison, charging the average smartphone requires 0.012 kWh of energy which means that the most efficient text generation model uses as much energy as 16% of a full smartphone charge for 1,000 inferences, whereas the least efficient image generation model uses as much energy as 950 smartphone charges (11.49 kWh), or nearly 1 charge per image generation.” The larger the model, the faster the same amount of electricity is consumed or CO2 emitted during the inference phase as during the training phase.

Since ‘general purpose applications’ for the same task consume more energy than models that have been trained for a specific purpose, Luccioni et al. point out several trade-offs: Firstly, the trade-off between model size vs. power consumption, as the benefits of multi-purpose models must be weighed against their power costs and CO2 emissions. Secondly, the trade-off between accuracy/efficiency and electricity consumption across different models, because the higher the accuracy or the higher the efficiency of a model, the lower the power consumption of specific models, whereas multi-purpose models can fulfil many different tasks, but have a lower accuracy and higher electricity consumption. According to the authors, these empirically proven findings call into question, for example, whether it is really necessary to operate multi-purpose models such as Bard and Bing, i.e. they “do not see convincing evidence for the necessity of their deployment in contexts where tasks are well-defined, for instance web search and navigation, given these models’ energy requirements.”

The hunger for power of large general purpose models does not bring the “limits to growth” to the attention of leading entrepreneurs and investors of Western big tech companies, like the famous Club of Rome report more than 50 years ago. On the contrary, CEOs such as Jeff Bezos, whose empire also includes the largest cloud computing platform AWS, fear stagnation: “We will have to stop growing, which I think is a very bad future.” Visions such as the Metaverse are extremely costly in terms of resource consumption and emissions, and it is fair to ask whether AI applications will really be available to all of humanity in the future or only to those companies or individuals who can afford them. Nothing of all of this is even remotely sustainable. Given the growing power consumption of Western big tech companies and the fact that the core infrastructure for the development of AI products is already centralised by those few players, it remains unclear where the development of ‘magical’ AI applications will lead. Scientist Kate Crawford has given her own answer to this in her book “Atlas of AI“: Into space, because that’s where the resources are that these corporations need.

Feeding the Cuckoo

Large Language Models (LLMs) combine words that frequently appear in similar contexts in the training dataset; on this basis, they predict the most probable word or sentence. The larger the training dataset, the more possible combinations there are, and the more ‘creative’ the model appears. The sheer size of models such as GPT-4 already provides a competitive advantage that is hard to match: There are only a handful of companies in the world that can combine exorbitant computing power, availability of big data and an enormous market reach to create such a product. No research institutions are involved in the current competition, but the big tech companies Microsoft, Meta and Google are. However, few players and few models also mean a “race to the bottom” in terms of security and ethics, as the use of big data with regard to LLMs most often also means that the training data contains sensitive and confidential information as well as copyrighted material. In numerous court cases, the tech giants have been accused of collecting the data of millions of users online without their consent and violating copyright law in order to train AI models.

Internet users have therefore already helped to feed the cuckoo child. Google published this fact indirectly by updating its privacy policy in June 2023: “use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” Less well known, however, is the fact that the big tech companies also train their models, such as Bard, with what users entrust to them. In other words, everything you tell a chatbot can in turn be used as training material. In Google’s own words, it sounds like this: “Google uses this data to provide, improve, and develop Google products, services, and machine-learning technologies.” One consequence of the design of LLMs, however, is that the output of generative models cannot be controlled; there are simply too many possibilities with large models. If the LLM was and is trained on private or confidential data, this can lead to these data being disclosed and confidential information being revealed. Therefore, the training data should already comply with data protection regulations, which is why there are repeated calls for transparency with regard to training data.

Consequently, in its Bard Privacy Help Hub, Google warns users of the model not to feed it with sensitive data: “Please don’t enter confidential information in your Bard conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.” This is interesting insofar as the AI hype is fuelled by terms such as ‘disruption’, but at the same time it remains unclear what the business model with which big tech companies want to generate profits in the medium term should look like – and what exactly the use case should look like for average users. One use case, however, is the generation of texts that are needed on a daily basis, namely well-formulated application letters. However, if you upload your own CV for this purpose, you’re just feeding the cuckoo again. And that is not in our interest: After all, privacy is (also) a commons.

On the Use of Licences in Times of Large Language Models

It could all be so simple: cultural heritage institutions and other public sector bodies provide high-quality data on a large scale and, wherever possible, under a permissive licence such as CC0 or Public Domain Mark 1.0. This is in line with the idea that cultural heritage institutions are funded by taxes, therefore everyone should also benefit from their services and products; in the case of data, innovation, research and of course private use should be possible.

However, we live in times of large language models and exploitative practices, especially of US-american big tech companies. Here, data are extracted from the web on a large scale and processed into proprietary large language models. These companies are not only the drivers of innovation, but also set themselves apart from research institutions, for example, by having specifically trained data sets at their disposal as well as exceptional computing power and the best-paid positions for developers of algorithms; all these elements are expensive ingredients of a recipe for success in the face of limited competition.

One of the weaknesses of ChatGPT – and presumably of GPT-4 – is its lack of reliability. This weakness results from the inability of purely stochastic language models to distinguish between fact and fiction; but also from a lack of data. Especially with regard to “hallucinated” literature references, bibliographic data from libraries are very attractive for building large language models. Another problem is the lack of high-quality text data. According to a recently published study, high-quality text data will be exhausted before the year 2026; this is mainly due to the lack of etiquette and proper spelling on the internet. But who, if not the libraries, have huge stocks of high-quality text data? Almost all the content available here has passed through a quality filter called “publishing houses”. One may be divided about the intellectual quality of the books; but linguistically and orthographically, everything that was printed until the end of the 20th century (i.e. before the advent of self-publishing) is of very good quality.

Finally, dear money: inflation is back, the low-interest phase is gone, the first Silicon Valley bank went bankrupt. Many companies based there will soon need fresh money; there will soon be monetisation to generate profits. New and more capable models will soon be created from products (such as ChatGPT) that were previously offered free of charge, providing demand-driven services in exchange for payment.

Should cultural heritage institutions as public entities serve the maximisation of the profits of a few companies by providing expensive and resource-intensive (and tax-funded) data for free? The answer has to be differentiated and therefore complicated. Of course, data should also be made available under permissive licences, as has been the case up to now. A dual strategy can certainly be used here. On the one hand, data made available via interfaces such as OAI-PMH or IIIF continue to be accessible under CC0 licence or or Public Domain Mark 1.0; technical access restrictions can prevent large-scale data extraction, e.g. by controlling IP addresses or download maxima. On the other hand, specific data publications can be provided that bundle individual data sets to enable research and innovation; such offerings are protected as databases for 15 years, and here licences can be used that contain a “NC” (non-commercial) mark and make such data usable for research and innovation. As an example, the Prussian Cultural Heritage Foundation uses such a licence (CC-BY-NC-SA) for the digital representation of one of its masterpieces, and the (not so easy to use) 3D scan is also freely available under this licence (download here).

Interestingly, the European Union anticipated the case described above in the Data Governance Act and included a relevant set of instruments. There is a chapter on the use of data provided by public sector bodies (Chapter II, Article 6), which regulates the provision of data in exchange for fees. It states that public sector bodies may differentiate the fees they charge between private users, small and medium-sized enterprises (SMEs) and start-ups on the one hand and larger corporations on the other, which don’t fall under the former definition. In this way, a possibility for differentiation within the framework of commercial users is created, whereby the fees have to be oriented at the costs of the infrastructure to provide data. This is something rather atypical in the European legal system, since the principle of equal treatment applies. Cultural heritage institutions thus have EU Commissioner for Competition Margrethe Vestager on their side, who presented the Data Governance Act in 2020 (that is applicable from 24 September 2023, by the way). Vestager is also Executive Vice President of the European Commission for a Europe Fit for the Digital Age and has imposed more than 15 billion Euros in antitrust fines in her first five years in office. So the enforcing political will seems to be there.

In case of doubt, this will be necessary. Licences like CC-BY-SA-NC effectively prevent the use of public data for commercial exploitation in large language models. But since the creators of large language models are moving around in a minefield regarding copyright, and in the case of other models, a stock photo agency or other rights holders have already filed copyright lawsuits, one must unfortunately doubt that they will show consideration in the future. Of course, the relevant court decisions remain to be seen in the pending cases. Even with reverse engineering, it is not easy to prove which data sets have been incorporated into a large language model; therefore, a kind of circumstantial evidence would have to be provided. In the medium and long term, it therefore seems more sensible to focus on establishing validation processes and standards that have to be implemented prior to publishing AI models. This includes the disclosure of the training material and process, its evaluation by experts, code audits, but also a reversal of evidence with regard to the licensing of the data material used. Making such procedures an obligatory part of the approval of commercial AI applications is then actually the task of the European Union.

Finally, another way is to publish cultural heritage data in a separate Data Space for Cultural Heritage; the tender for this Data Space was launched last autumn and is part of the European Union’s Data Act. To what extent this Data Space will grant full data sovereignty  to cultural heritage institutions and thus the possibility to control access to data publications remains to be seen.

On the Use of ChatGPT in Cultural Heritage Institutions

Since the release of the ChatGPT dialogue system in November 2022, the societal debate about artificial intelligence (AI) has gained significant momentum and has also reached cultural heritage institutions (such as libraries, archives, and museums). The main challenge is to assess how powerful such large language models (LLMs) are in general, and Generative Pre-trained Transformers (GPTs) in particular. For the cultural heritage sector, the ChatGPT chat bot prototype reveals a whole range of possible uses: producing text summaries or descriptions of artworks, generating metadata, writing computer code for simple tasks, assisting with subject indexing and keyword indexing, or helping users find resources on the websites of cultural heritage institutions.

Undoubtedly, ChatGPT’s strengths lie in the generation of text and associated tasks. As “statistical parrots,” as these large language models were called in a much-discussed 2021 paper, these language models can predict on a stochastic basis what the next words of a snippet of text will look like. In this context, the ChatGPT use case has been trained – as a text-based dialogue system – to provide answers at any rate. This property of the chat bot points directly to one of the central weaknesses of the model: In case of doubt, ChatGPT provides untrue statements in order to maintain the dialogue. Since large language models are, after all, only applications of artificial intelligence and have no knowledge of the world, they cannot per se distinguish between fact and fiction, social construction and untruth. The fact that ChatGPT “hallucinates” (as the common anthropomorphizing term goes) when in doubt and also e.g., invents literature references, damages of course the reliability of the system – and it points to the great strength of libraries in providing authoritative evidence.

On the other hand, a strength of such systems is that they can excellently reproduce discourses and are therefore able to classify individual texts or larger text corpora and to describe their content in an outstanding way. This shows great potential, especially for libraries: Up to now, digital assistants that support the indexing of books have at best worked with statistical methods such as tf-idf, or with deep learning. Such approaches could be complemented through the use of topic modeling. The latter method generates a stochastically modelled text sequence that describes the content of a work or the topics it deals with. The challenge for users so far has been to interpret this collection of words and assign a coherent label to it – and this is exactly what ChatGPT does excellently, as several researchers have confirmed. Since this massively improves and facilitates the labelling of texts, this is certainly one of the most probable use cases for AI in libraries, and exactly the field on which the sub-project 3 “AI-supported content analysis and subject indexing” of the project “Human.Machine.Culture” focuses. By contrast, simple programming tasks such as creating a bibliographic record in a specific format or transforming a record from MARC.xml to JSON are in need of improvement; ChatGPT does not always perform such tasks reliably, as a recent experiment showed.

ChatGPT, as one of the most powerful text-based AI applications currently available, underlines the potential benefits of such models. At the same time, however, it also highlights the risks associated with the use of such applications: So far, only U.S.-American big tech companies are able to train such powerful models, make them accessible, and develop later onwards models optimized through reinforcement learning for specific tasks – with the clear goal of monetization. In addition, generative AI systems bring with them a number of ethical issues, as they require large masses of text that have so far been taken from the Internet and thus a place where not all people interact politely and with all etiquette. For example, a recent study has underlined that large language models reproduce stereotypes by associating the terms “Muslims” and “violence”. Moreover, toxic content in the language models have to be labeled as such, an operation that is being carried out by underpaid workers; this underlines again the ethical dubiousness of the process of establishing such models.

Finally, the fact that these models have been trained almost exclusively based on 21st century textual material available on the Internet has to be underlined. By contrast, sub-project 4 “Data provision and curation for AI” of the project “Human.Machine.Culture” concentrates on the provision of curated and historical data from libraries for AI applications. Finally, the deployment of large langage models points to very fundamental questions: Namely, what role the cultural heritage of all humanity should play in the future and what effect cultural heritage institutions like libraries, archives and museums may have on their establishment; and what influence the texts generated by large language models will have on our contemporary culture as such.