Tag Archive for: licences

Orientation in Turbulent Times

Cultural heritage institutions such as galleries, libraries, archives and museums (GLAMs) currently find themselves in a difficult situation: Generative AI models have fundamentally changed the meaning of the term “openness”. Until recently, the open provision of digital cultural heritage was an absolute ideal, as was the protection of intellectual property rights (IPR). There is a grey area between this pair of oppositions with many fine nuances, and guidelines offer orientation to navigate between these oppositions in case of doubt. Openness should enable the creation of new culture on the basis of existing cultural heritage and to stimulate innovation and research, ideally by providing material that is in the public domain. Cultural heritage institutions can conclude licence agreements with publishing houses as the holders of copyrights. Until now, cultural heritage institutions have therefore seen their role as access brokers, balancing creator-friendly copyright and accessibility.

The development of generative AI applications, especially in the 2020s, has significantly complicated this situation: What is the relationship between generative AI and intellectual property? Can such models be trained with copyrighted material? Can copyright holders refuse to allow their material to be used to train machine learning applications? Who owns the copyright to the output of these models? Can certain commercial organisations be excluded from using copyrighted material while allowing other (commercial) users to do so? Cultural heritage institutions now have to navigate between the monsters of Scylla (intellectual property protection) and Charybdis (restrictions for commercial companies). The fact that there are now two Messina lighthouses (openness for all, and provision of cultural heritage data sets for innovation and research) does not make things any easier.

Strait of Messina, Scylla and Charybdis. Drawing by Karl Friedrich Schinkel Karl Friedrich Schinkel, “Strait of Messina, Scylla and Charybdis”. Public Domain, Kupferstichkabinett of Berlin State Museums

The previously existing pair of oppositions, which often represented a dilemma (i.e. a situation in which every decision in favour of one of the oppositions leads to an undesirable outcome), is now replaced by four poles – with significantly more options for action: Affirmation, Negation, Both, Neither. This tetralemmatic situation is particularly striking for research libraries, as they have a treasure that is becoming increasingly valuable: Digitally available books with syntactically and lexically correct texts from trusted sources such as a cultural heritage institution or publishers have become a depletable and, in the near future, contested source for the training of Large Language Models. According to one study, high-quality text data in English will be exhausted before 2026, and the time horizon for other world languages is unlikely to be much longer. The stocks of public domain works that are constantly being digitised by libraries are therefore also currently increasing in value – ironically, however, including texts that are actually published in open access and for which the major publishing houses will secure usage rights in the near future in order to be able to train their own models. Libraries that have entered into licence agreements with publishers in order to be able to make copyrighted works available in digital form have a problem if the licence agreements explicitly exclude the use of protected content for training purposes. If there is no statement to this effect yet, it is advisable, depending on the national context, to protect the claims of the rights holders. The Royal Library of the Netherlands (KB) has therefore excluded commercial companies from downloading such resources, as there is a fear that such companies will violate copyright law, and the KB has updated its terms of use accordingly. This is unusual in that no distinction was previously made between different users. Legally, such an approach can be problematic if it prevents access to public domain material. Technically, blocking crawlers is only an emergency solution, as crawlers cannot be effectively blocked from the content provided; legally, action must also be taken against unauthorised use in the event of an infringement. And finally: Is it ethically correct to block commercial companies from certain content? After all, this also affects start-ups, small and medium-sized enterprises (SMEs) and companies in the creative sector. How can we legitimately differentiate between big tech companies and smaller players?

It is not surprising that there is a lack of clarity about the legal framework: the law often lags behind reality. The AI Act, which was negotiated with a compromise, is due to be passed and come into force this year. What will the regulations look like here – and will they really provide clarity? Entities that develop AI applications and operate in the EU will be required to develop a “policy to respect Union copyright law”. The use of copyright-protected works for the training of AI models is linked to the text and data mining (TDM) exception in Article 4 of the “Directive on copyright in the Digital Single Market“. This allows AI models to be trained with copyrighted material. However, the cited directive also provides for the possibility for rights holders to reserve their rights to prevent text and data mining; “where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightholders if they want to carry out text and data mining over such works.” This is where it gets tricky: So far, there is no standardised legal process for this, and it is unclear along which (technical) standard or protocol the right to opt-out should be formulated in machine-readable form. It is therefore not surprising that even a non-profit organisation such as Creative Commons has called for the option to opt out of such use to become an enforceable right.

Against this background, it becomes clear that cultural heritage institutions must abandon the ideal of openness, at least if it is set in absolute terms. Rather, nuances need to be added here: open to private users and research, but not to the cultural industry, to start-ups, small and medium-sized enterprises and commercial AI companies, if the rights holders wish. In pragmatic terms, this initially means that numerous licence agreements will have to be renegotiated in order to clearly document the rights holders’ position. Nevertheless, many questions remain unanswered: What about the numerous works where the rights of use have not been clarified? Is it possible to differentiate between SMEs and big tech companies, or does “NoAI” simply apply across the board? Shouldn’t there also be separate licences for this? Who is responsible for developing technical standards and protocols to implement the opt-out in a machine-readable way? Who is responsible for initiating the “machine unlearning” of models that have already been trained with copyright-protected works?

On the Use of Licences in Times of Large Language Models

It could all be so simple: cultural heritage institutions and other public sector bodies provide high-quality data on a large scale and, wherever possible, under a permissive licence such as CC0 or Public Domain Mark 1.0. This is in line with the idea that cultural heritage institutions are funded by taxes, therefore everyone should also benefit from their services and products; in the case of data, innovation, research and of course private use should be possible.

However, we live in times of large language models and exploitative practices, especially of US-american big tech companies. Here, data are extracted from the web on a large scale and processed into proprietary large language models. These companies are not only the drivers of innovation, but also set themselves apart from research institutions, for example, by having specifically trained data sets at their disposal as well as exceptional computing power and the best-paid positions for developers of algorithms; all these elements are expensive ingredients of a recipe for success in the face of limited competition.

One of the weaknesses of ChatGPT – and presumably of GPT-4 – is its lack of reliability. This weakness results from the inability of purely stochastic language models to distinguish between fact and fiction; but also from a lack of data. Especially with regard to “hallucinated” literature references, bibliographic data from libraries are very attractive for building large language models. Another problem is the lack of high-quality text data. According to a recently published study, high-quality text data will be exhausted before the year 2026; this is mainly due to the lack of etiquette and proper spelling on the internet. But who, if not the libraries, have huge stocks of high-quality text data? Almost all the content available here has passed through a quality filter called “publishing houses”. One may be divided about the intellectual quality of the books; but linguistically and orthographically, everything that was printed until the end of the 20th century (i.e. before the advent of self-publishing) is of very good quality.

Finally, dear money: inflation is back, the low-interest phase is gone, the first Silicon Valley bank went bankrupt. Many companies based there will soon need fresh money; there will soon be monetisation to generate profits. New and more capable models will soon be created from products (such as ChatGPT) that were previously offered free of charge, providing demand-driven services in exchange for payment.

Should cultural heritage institutions as public entities serve the maximisation of the profits of a few companies by providing expensive and resource-intensive (and tax-funded) data for free? The answer has to be differentiated and therefore complicated. Of course, data should also be made available under permissive licences, as has been the case up to now. A dual strategy can certainly be used here. On the one hand, data made available via interfaces such as OAI-PMH or IIIF continue to be accessible under CC0 licence or or Public Domain Mark 1.0; technical access restrictions can prevent large-scale data extraction, e.g. by controlling IP addresses or download maxima. On the other hand, specific data publications can be provided that bundle individual data sets to enable research and innovation; such offerings are protected as databases for 15 years, and here licences can be used that contain a “NC” (non-commercial) mark and make such data usable for research and innovation. As an example, the Prussian Cultural Heritage Foundation uses such a licence (CC-BY-NC-SA) for the digital representation of one of its masterpieces, and the (not so easy to use) 3D scan is also freely available under this licence (download here).

Interestingly, the European Union anticipated the case described above in the Data Governance Act and included a relevant set of instruments. There is a chapter on the use of data provided by public sector bodies (Chapter II, Article 6), which regulates the provision of data in exchange for fees. It states that public sector bodies may differentiate the fees they charge between private users, small and medium-sized enterprises (SMEs) and start-ups on the one hand and larger corporations on the other, which don’t fall under the former definition. In this way, a possibility for differentiation within the framework of commercial users is created, whereby the fees have to be oriented at the costs of the infrastructure to provide data. This is something rather atypical in the European legal system, since the principle of equal treatment applies. Cultural heritage institutions thus have EU Commissioner for Competition Margrethe Vestager on their side, who presented the Data Governance Act in 2020 (that is applicable from 24 September 2023, by the way). Vestager is also Executive Vice President of the European Commission for a Europe Fit for the Digital Age and has imposed more than 15 billion Euros in antitrust fines in her first five years in office. So the enforcing political will seems to be there.

In case of doubt, this will be necessary. Licences like CC-BY-SA-NC effectively prevent the use of public data for commercial exploitation in large language models. But since the creators of large language models are moving around in a minefield regarding copyright, and in the case of other models, a stock photo agency or other rights holders have already filed copyright lawsuits, one must unfortunately doubt that they will show consideration in the future. Of course, the relevant court decisions remain to be seen in the pending cases. Even with reverse engineering, it is not easy to prove which data sets have been incorporated into a large language model; therefore, a kind of circumstantial evidence would have to be provided. In the medium and long term, it therefore seems more sensible to focus on establishing validation processes and standards that have to be implemented prior to publishing AI models. This includes the disclosure of the training material and process, its evaluation by experts, code audits, but also a reversal of evidence with regard to the licensing of the data material used. Making such procedures an obligatory part of the approval of commercial AI applications is then actually the task of the European Union.

Finally, another way is to publish cultural heritage data in a separate Data Space for Cultural Heritage; the tender for this Data Space was launched last autumn and is part of the European Union’s Data Act. To what extent this Data Space will grant full data sovereignty  to cultural heritage institutions and thus the possibility to control access to data publications remains to be seen.