AI ‘opt-outs’: should cultural heritage institutions (dis)allow the mining of cultural heritage data?

The text and data mining provisions

In 2019, the Copyright in the Digital Single Market Directive made it possible for anyone to make copies and extract large amounts of copyright-protected data to which they have legal access without permission from the rightsholder, in order to carry out data mining activities. This is possible unless the rightsholder expressly chooses to ‘opt-out’ the copyright-protected data (through machine-readable means) from being mined. This opt-out possibility does not apply to data mining by cultural heritage and research institutions, for research purposes. At the time, text and data mining was not new in other parts of the world, and the European Union was suffering from a competitive disadvantage by not having legal clarity in its jurisdiction.

These provisions are there to ensure that copyright does not stand in the way of the opportunities that the analysis of large amounts of data brings to the research and cultural heritage sectors in the European Union (by substantially improving the analysis and discoverability of information) and for the information society at large.

Blocking data mining from cultural heritage data

In 2019, cultural heritage institutions, advocating for democratic access to information, were in favour of the text and data mining exceptions. It was therefore unexpected that these same institutions would consider making use of the opt-out possibility to block the mining of copyright-protected cultural heritage data.

Opting-out of this type of processing has raised recent discussions in the cultural heritage sector. The National Library of the Netherlands, for example, added wording to its terms and conditions which prohibits all commercial generative AIs to mine the copyright protected works of the library. Via machine-readable methods, it explicitly forbids ChatGPT to harvest their collections.

In certain cases, the reason for implementing an opt-out seems to be that copyright rightsholders ask for this opt-out as a condition for data to be shared through a cultural heritage organisation’s website. This is sometimes done by the individual rightsholder, or by a collective management organisation, such as Pictoright in The Netherlands and the Sacem in France. But sometimes the willingness seems to come from the cultural heritage institution itself, wanting to ensure that creators are respected through a transparent (attributed) and permission-based use of their creations.

Among the main arguments, some warn of the need to block the mining of data to stop certain ‘big tech’ companies working with generative-AI from mining data. Indeed, some large for-profit companies analyse large amounts of copyright-protected data without much transparency. They have been criticised for nurturing themselves on the ‘commons’ (content available free of copyright restrictions) without contributing back to them while reinforcing their competitive advantage.

Beyond what is legally possible: what should the heritage sector stand for?

In most cases, cultural heritage institutions will give access to materials that are either not copyright-protected, or that are protected and for which the rightsholders have authorised the posting online, but for which the cultural heritage institution does not hold the copyright. In such cases, cultural heritage institutions are not entitled to make the decision to apply a data mining opt-out. They can only do so if copyright exists, and they hold the copyright.

But even if they do, it is worth wondering whether opting out supports their objectives. In a way, blocking the possibility to use cultural heritage data seems counter to the mission of publicly funded cultural heritage institutions. Isn’t contributing trustworthy qualitative information and fighting misinformation and bias (in algorithms) more in line with their objectives?

When it comes to correcting the bad practice of some big players in the AI world, would opting-out cultural heritage data actually weaken them? Big tech companies can take legal risks, pay a fine, or pay the price for legally mining the data. Excluding cultural heritage data will not stop them from using it, but is likely to instead have a negative impact on SMEs, journalists, cultural heritage professionals and researchers themselves who use the data, and also the tools both for research but also more general purposes. It risks weakening those who need the commons the most. The boundaries between commercial and research are increasingly vague. Where do we draw the line?

Should cultural heritage institutions level the playing field and safeguard open access to cultural content by everyone, also by machines? If there are no opt-out solutions available or used that are suitable to be applied on an item by item basis, there is a clear risk that applying a machine-readable opt-out will overflow on public domain material that is made available online.

The case of out of commerce works

With the copyright directive mentioned above, the out of commerce works system was adopted: a new legal solution through which cultural heritage institutions can share materials online in their collections that are not (or no longer) in commercial circulation, even though they are subject to copyright protection, without permission from the copyright holder. This new system removes the (impossible) burden of clearing copyright in large collections.

This generally requires obtaining a licence from a collective management organisation, one that is representative for the types of materials in question. Through the Directive, the organisations are entitled to give ‘extended’ collective licences: they can authorise cultural heritage institutions to use materials that are part of the collective management organisation’s repertoire, but also materials that are not.

Some collective management organisations are including an obligation to ‘opt-out’ these out of commerce works from being mined, when shared online by the cultural heritage institution. In the context of ‘extended’ collective licensing, this is both practically and legally problematic. Practically, as it limits the reuse possibilities of the material and places an additional burden on the cultural heritage institution. Legally, because it is debatable whether a collective management organisation in an ‘extended’ collective management licence is the rightsholder entitled to exercise a data mining opt-out.

Next steps

We in the Copyright Community will continue to follow developments in this area closely. Stay tuned by joining our Community through the Europeana Network Association and following us on social media. If you wish to share any feedback on this topic with us, please reach out to copyright@europeana.eu.

You can read more about text and data mining on copyrightuser.org and on the Communia CDSM Directive transposition portal.