Tag Archive for: openness

Openness, Efficiency and Closed Infrastructures

The concept of data spaces, that the European Commission is pursuing, is not only a technical one; it also implies a political constitution. Data spaces such as GAIA-X do not require centralised management. The operation of such a data space can take place within a federation that establishes the means to control data integrity and data trustworthiness. The federation that operates the data space is therefore more like the European Union (i.e. a federation of states) than like a centralised democracy. And trust is not only something that characterises cultural heritage institutions in terms of data and machine learning models. Such institutions fulfil their mission on the basis of the trust that people place in them, a trust that has grown over decades or centuries and is an expression of people’s conviction that these renowned and time-honoured institutions make the right decisions and, for example, make the right choices when acquiring their objects.

The political concept of data spaces thus stands in clear contrast to the hierarchical and opaque structures of big tech companies. With regard to data and machine learning models, a clear centralisation movement can be observed in the relevant corporations (Alphabet, Meta, Amazon, Microsoft) since the 2010s, particularly with regard to research and development and the provision of infrastructure. A study published in 2022 on the values that are central to machine learning research emphasises two insights: firstly, the presence of large tech companies in the 100 most-cited studies published in the two most influential machine learning conferences is massively increasing. “For example, in 2008/09, 24% of these top cited papers had corporate affiliated authors, and in 2018/19 this statistic more than doubled, to 55%. Moreover, of these corporations connected to influential papers, the presence of “big-tech” firms, such as Google and Microsoft, more than tripled from 21% to 66%.” This means that tech companies are almost as frequently involved in the most important research as the most important universities. Putting the consequences of this privatisation of research for the distribution of knowledge production in Western societies into perspective would be worthy of its own studies. On the other hand, the study by Birhane et al. emphasises a value that is repeatedly highlighted in the 100 examined research articles: Efficiency. The praise of efficiency is in this case not neutral, as it favours those institutions that are able to process constantly growing amounts of data and procure and deploy the necessary resources. In other words, emphasising a technical-sounding value such as efficiency “facilitates and encourages the most powerful actors to scale up their computation to ever higher orders of magnitude, making their models even less accessible to those without resources to use them and decreasing the ability to compete with them.”

Feigned door of Sokarhotep, symbolising the feigned openness of AI applications provided by big tech

Feigned door of Sokarhotep, Old Kingdom, 5. Dynasty. Ägyptisches Museum und Papyrussammlung. CC BY-SA 4.0.
The feigned door of Sokarhotep symbolises the feigned openness of AI applications provided by big tech

This already addresses the second aspect, the power of disposal over infrastructure. There is no doubt that there is already a “compute divide” between the big tech companies and e.g. elite universities. Research and development in the field of machine learning is currently highly dependent on the infrastructure provided by a small number of actors. This situation also has an impact on the open provision of models. When openness becomes a question of access to resources, scale becomes a problem for openness: Truly open AI systems are not possible if the resources needed to build them from scratch and deploy them on a large scale remain closed because they are only available to those who have these significant resources at their disposal. And these are almost always corporations. A recently published study on the concentration of power and the political economy of open AI therefore concludes that open source and centralisation are mutually exclusive: “only a few large tech corporations can create and deploy large AI systems at scale, from start to finish – a far cry from the decentralized and modifiable infrastructure that once animated the dream of the free/open source software movement”. A company name like “OpenAI” thus becomes an oxymoron.

Against this backdrop, it becomes clear that the European concept of data spaces represents a counter-movement to the monopolistic structures of tech companies. The openness, data sovereignty and trustworthiness that these data spaces represent will not open up the possibility of building infrastructures that can compete with those of the big tech companies. However, they will make it possible to develop specific models with clearly defined tasks that work more efficiently than the general-purpose applications developed by the tech companies. In this way, the value of efficiency, which is central to the field of machine learning, could be recoded.

Openness, and Some of its Shades

Openness, that lighthouse of the 20th century, came along with open interfaces (APIs). In the case of galleries, libraries, archives, and museums (GLAMs), it was the Open Archives Initiative Protocol for Metadata Harvesting, or OAI-PMH for short. At the time, the idea was to provide an interface that makes metadata available in interoperable formats and thus enables exchange between different institutions. In addition, the harvesting of distributed resources described in XML format is made possible, which may be restricted to named sets defined by the provider. The objects are referenced via URLs in the metadata; this also facilitates access to the objects themselves. Basically, the protocol is not designed to differentiate between users; licences and rights statements can be included, but it was not foreseen to mask specific material from access: The decision whether or not (and which) use would be made from material protected by intellectual property rights in the end lies with the users.

Lighthouse on the Breton coast, painting by Théodore Gudin, 1845

Lighthouse on the Breton coast, painting by Théodore Gudin, 1845. Staatliche Museen zu Berlin, Nationalgalerie. Public Domain Mark 1.0

The 21st century brought a new concept: Data sovereignty. This implies, on the one hand, that data are subject to the laws and governance structures that apply in the jurisdiction where the data are hosted; and for the hosts, on the other hand, the concept stands for the notion that rights holders can determine themselves what third parties may and can do with the data. With regard to the situation that there is now a second lighthouse – provision of cultural heritage data sets for innovation and research – providing orientation in troubled times, the role of cultural heritage institutions as access brokers becomes tangible: If rights holders do not wish to provide their (IPR protected) data openly to commercial AI companies, GLAM institutions as data providers are in the position to negotiate differentiations in the use of these data. For example, they may be used freely by startups, small and medium-sized enterprises (SMEs) and companies active in the cultural sector, while for big tech this could involve fees. Interestingly, the European Data Governance Act foresees such a case and includes a relevant set of instruments. There is a chapter on the use of data provided by public sector bodies (Chapter II, Article 6), which regulates the provision of data in exchange for fees and allows for the differentiation of the fees to be charged between private users, SMEs and startups on the one hand, and larger corporations on the other, which don’t fall under the former condition. In this way, a possibility for differentiation within the framework of commercial users is created, whereby the fees have to be oriented at the costs of the infrastructure to provide data. For these cases, cultural heritage institutions need new licences (or rights statements), clarifying whether or not commercial enterprises are excluded from the access to data based on the opt-out option of the rights holders; and clarifying whether or not big tech corporations get access by paying fees while data are provided free of cost to start-ups and and SMEs.

While this describes the legal side of the role of GLAM institutions as access brokers, there is also a technical side to data sovereignty, addressed by “data spaces”. APIs like OAI-PMH will continue to ensure the exchange between institutions, but will lose in importance in terms of data provision for third parties (apart from the provision of material which is in the public domain). By contrast, the concept of data spaces, which is of central importance for the European Commission’s policy for the upcoming years, will gain in importance. One planned data space is, e.g., the European Data Space for Cultural Heritage, which is to be created in collaboration with Europeana; existing similar initiatives include the European Open Science Cloud (EOSC) and the European Collaborative Cloud for Cultural Heritage (ECCCH). A technical implementation of such a data space is GAIA-X, a European initiative for an independent cloud infrastructure. Amongst other functionalities, it enables GLAM institutions to keep their data on premise while delivering processed data to users of the infrastructure after applying an algorithm of their choice to the data held by the cultural heritage institution: Instead of downloading terabytes of data and processing them on their own, the algorithm (or machine learning model) can be selected and sent to the data. An example providing such functionalities has been developed by Berlin State Library with the CrossAsia Demonstrator. Such an infrastructure does not only enable the handling of data with various rights of use, but also allows a differentiation between users as well as payment services. In other words: It grants full sovereignty over the data. As with all technical solutions, there is a downside: Such data spaces are usually complex and difficult to manage, which entails an obstacle for cultural heritage institutions, and often results in the need for additional manpower.

Linked (but not bound) to the concepts of data spaces and data sovereignty is the idea of a commons. “Commons” designates a shared resource that is managed by a community for the benefit of its members. Europeana, the meta-aggregator and web portal for the digital collection of European cultural heritage, explicitly conceptualises the planned European Data Space for Cultural Heritage as “an open and resilient commons for the users of European cultural data, where data owners – as opposed to platforms – have control of their data and of how, when and with whom it is shared“. The formulation chosen here is indicative of a learning process with regard to openness: defining an open commons “as opposed to platforms” addresses an issue which is characteristic of open commons, namely the over-use of the available resources which may lead to their depletion. In the classical examples of commons like fishing grounds or pasture, the resource is endangered if users try to profit from it without contributing at the same time to its preservation. However, this is not the case with digital resources. Rather, the issue lies with the potential loss of communal benefits due to actions motivated by self-interest. In the 21st century, the rise of the big platforms has revealed what has been termed “the paradox of open”: “open resources are most likely to contribute to the power of those with the best means to make use of them“. The need for data spaces managed by a community for the benefit of its members does not only add another shade to openness; at the same time, it opens up another front – the turn against platformization implies a rejection of the dominance of non-European big tech companies.

Orientation in Turbulent Times

Cultural heritage institutions such as galleries, libraries, archives and museums (GLAMs) currently find themselves in a difficult situation: Generative AI models have fundamentally changed the meaning of the term “openness”. Until recently, the open provision of digital cultural heritage was an absolute ideal, as was the protection of intellectual property rights (IPR). There is a grey area between this pair of oppositions with many fine nuances, and guidelines offer orientation to navigate between these oppositions in case of doubt. Openness should enable the creation of new culture on the basis of existing cultural heritage and to stimulate innovation and research, ideally by providing material that is in the public domain. Cultural heritage institutions can conclude licence agreements with publishing houses as the holders of copyrights. Until now, cultural heritage institutions have therefore seen their role as access brokers, balancing creator-friendly copyright and accessibility.

The development of generative AI applications, especially in the 2020s, has significantly complicated this situation: What is the relationship between generative AI and intellectual property? Can such models be trained with copyrighted material? Can copyright holders refuse to allow their material to be used to train machine learning applications? Who owns the copyright to the output of these models? Can certain commercial organisations be excluded from using copyrighted material while allowing other (commercial) users to do so? Cultural heritage institutions now have to navigate between the monsters of Scylla (intellectual property protection) and Charybdis (restrictions for commercial companies). The fact that there are now two Messina lighthouses (openness for all, and provision of cultural heritage data sets for innovation and research) does not make things any easier.

Strait of Messina, Scylla and Charybdis. Drawing by Karl Friedrich Schinkel Karl Friedrich Schinkel, “Strait of Messina, Scylla and Charybdis”. Public Domain, Kupferstichkabinett of Berlin State Museums

The previously existing pair of oppositions, which often represented a dilemma (i.e. a situation in which every decision in favour of one of the oppositions leads to an undesirable outcome), is now replaced by four poles – with significantly more options for action: Affirmation, Negation, Both, Neither. This tetralemmatic situation is particularly striking for research libraries, as they have a treasure that is becoming increasingly valuable: Digitally available books with syntactically and lexically correct texts from trusted sources such as a cultural heritage institution or publishers have become a depletable and, in the near future, contested source for the training of Large Language Models. According to one study, high-quality text data in English will be exhausted before 2026, and the time horizon for other world languages is unlikely to be much longer. The stocks of public domain works that are constantly being digitised by libraries are therefore also currently increasing in value – ironically, however, including texts that are actually published in open access and for which the major publishing houses will secure usage rights in the near future in order to be able to train their own models. Libraries that have entered into licence agreements with publishers in order to be able to make copyrighted works available in digital form have a problem if the licence agreements explicitly exclude the use of protected content for training purposes. If there is no statement to this effect yet, it is advisable, depending on the national context, to protect the claims of the rights holders. The Royal Library of the Netherlands (KB) has therefore excluded commercial companies from downloading such resources, as there is a fear that such companies will violate copyright law, and the KB has updated its terms of use accordingly. This is unusual in that no distinction was previously made between different users. Legally, such an approach can be problematic if it prevents access to public domain material. Technically, blocking crawlers is only an emergency solution, as crawlers cannot be effectively blocked from the content provided; legally, action must also be taken against unauthorised use in the event of an infringement. And finally: Is it ethically correct to block commercial companies from certain content? After all, this also affects start-ups, small and medium-sized enterprises (SMEs) and companies in the creative sector. How can we legitimately differentiate between big tech companies and smaller players?

It is not surprising that there is a lack of clarity about the legal framework: the law often lags behind reality. The AI Act, which was negotiated with a compromise, is due to be passed and come into force this year. What will the regulations look like here – and will they really provide clarity? Entities that develop AI applications and operate in the EU will be required to develop a “policy to respect Union copyright law”. The use of copyright-protected works for the training of AI models is linked to the text and data mining (TDM) exception in Article 4 of the “Directive on copyright in the Digital Single Market“. This allows AI models to be trained with copyrighted material. However, the cited directive also provides for the possibility for rights holders to reserve their rights to prevent text and data mining; “where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightholders if they want to carry out text and data mining over such works.” This is where it gets tricky: So far, there is no standardised legal process for this, and it is unclear along which (technical) standard or protocol the right to opt-out should be formulated in machine-readable form. It is therefore not surprising that even a non-profit organisation such as Creative Commons has called for the option to opt out of such use to become an enforceable right.

Against this background, it becomes clear that cultural heritage institutions must abandon the ideal of openness, at least if it is set in absolute terms. Rather, nuances need to be added here: open to private users and research, but not to the cultural industry, to start-ups, small and medium-sized enterprises and commercial AI companies, if the rights holders wish. In pragmatic terms, this initially means that numerous licence agreements will have to be renegotiated in order to clearly document the rights holders’ position. Nevertheless, many questions remain unanswered: What about the numerous works where the rights of use have not been clarified? Is it possible to differentiate between SMEs and big tech companies, or does “NoAI” simply apply across the board? Shouldn’t there also be separate licences for this? Who is responsible for developing technical standards and protocols to implement the opt-out in a machine-readable way? Who is responsible for initiating the “machine unlearning” of models that have already been trained with copyright-protected works?